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Discrimination  networks  are  data  structures  that  are  used  to  implement  rule 
condition  testing  in  active  databases.  Rules  allow  actions  to  be  automatically  undertaken 
(user  notification,  firing  of  another  rule,  etc.)  when  the  database  enters  a  particular  state 
following  a  data  insertion,  deletion  or  modification. 

Traditional  database  management  systems  handle  limited  (built-in)  sets  of 
datatypes  and  operators,  typically  numbers,  strings  of  characters,  money  and  datetime 
types,  and  basic  operators.  The  extensibility  feature  allows  knowledgeable  users  to  "plug- 
in"  newly  defined  datatypes  (such  as  polygons  and  time-series)  and  operators  on  those 
datatypes  (such  as  overlapping  of  polygons,  variations  in  the  time-series).  As  a  result,  the 
spectrum  of  use  of  database  management  systems  is  extended  to  many  more  types  of  user 
applications  (scientific,  business,  stock  exchange,  etc.). 


The  focus  of  this  dissertation  is  on  generating  and  optimizing  discrimination 
networks,  which  may  contain  user-defined  datatypes  and  operators  as  discussed  above. 

A  discrimination  network  uses  two  sets  of  operators  to  match  rule  conditions: 
selections  and  joins.  Traditionally,  selections  have  been  put  at  the  top  of  the  network  so 
that  they  are  computed  first;  the  reason  being  that  they  are  usually  cheap  and  simple 
operators  with  respect  to  joins.  While  that  assumption  might  hold  for  non-extensible 
systems,  user  defined  operators  (such  as  polygon  overlapping,  or  fingerprint  matching) 
can  be  expensive  and  complex  operators,  hence,  their  cost  is  no  longer  negligible  with 
respect  to  that  of  joins,  hi  such  cases,  pushing  a  selection  operator  down  the 
discrimination  network  (in  order  to  delay  its  evaluation  in  the  rule  condition)  may  prove 
to  represent  an  effective  cost  improvement  for  rule  matching. 

Query  optimization  techniques  have  been  proposed  to  handle  the  issue  of  Joins 
and  selection  predicate  ordering  in  query  trees.  This  dissertation  first  shows  how  the 
aforementioned  methods  do  not  necessarily  apply  to  discrimination  networks,  and  how  a 
direct  application  of  those  could  fail  to  produce  a  correct  ordering  of  joins  and  selections. 
We  proceed  by  proposing  a  set  of  strategies  (of  increasing  levels  of  complexity)  which 
can  be  applied  towards  the  placement  of  selection  predicates  in  a  discrimination  network, 
and  consider  the  corresponding  implementation  issues. 
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CHAPTER  1 

INTRODUCTION:  THE  TRIGGERMAN  PROJECT 
1  ■  1    Rules  and  Active  Databases  Overview: 

In  a  passive  (traditional)  database  management  system  (DBMS),  data  is 
manipulated  (created,  updated  and  deleted)  only  in  response  to  queries  issued  by  users  or 
user-level  applications.  On  the  other  hand,  a  system  fitted  with  active  capabilities 
(ADBMS)  has  the  ability  to  perform  such  operations  automatically.  More  generally,  any 
action  can  be  invoked  in  response  to  a  particular  event,  when  some  predefined 
condition(s)  is  (are)  true.  The  active  behavior  of  such  database  systems  is  expressed  by 
rules,  also  referred  to  as  triggers,  alerters,  or  monitors  in  the  hterature.  In  their  most 
popular  form,  active  rules  are  defined  with  a  threefold  structure:  Event-Condition- Action, 
very  often  referred  to  as  the  EC  A  model  in  the  literature  [Cha89,  Day96].  The  existence 
of  a  rule  processor  (rule-processing  module)  allows  for  a  number  of  tasks  to  be  carried 
out  by  active  database  systems  without  a  dedicated  application.  A  few  examples  of  such 
tasks  are  enumerated  in  the  following  three  sections. 

1.1.1  Integritv  Constraint  Checking  and  Repair: 

Issues  related  to  automating  integrity  constraints  checking  were  fairly  extensively 
investigated  and  are  a  classical  application  area  for  active  databases.  An  article  proposed 
by  Ceri  et  al  [Cer92]  describes  a  system  that  ensures  automatic  correction  of  violated 
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constraints  through  production  rules.  Additional  work  by  Ceri  and  Widom  [Cer90]  also 
proposes  the  use  of  active  rules  for  constraint  maintenance.  A  special  language  for 
defining  the  constraints  is  described,  as  well  as  a  framework  for  translating  those 
constraints  into  rules.  On  a  similar  note,  Etzion  [Etz91]  proposes  to  use  active  capabilities 
to  handle  incomplete  or  exceptional  information  in  a  database  system.  Fratemali  and 
Paraboschi's  paper  [Fra93]  is  also  a  good  overview  of  integrity  constraints  maintenance 
techniques,  mostly  based  on  first  order  logic.  In  another  research  effort,  Gertz  [Ger94] 
proposed  concentrating  on  the  repair  aspect  of  integrity  constraints,  rather  than  mere 
detection.  More  generally,  many  active  database  research  projects  did  undertake  this 
question  in  some  way  or  another:  the  Jasmine  prototype  [Ish93]  is  an  object-oriented 
(OO)  database  with  constraint  facilities,  able  to  handle  both  intra  and  inter-object 
constraints.  Ode  is  another  object-oriented  DBMS  developped  at  AT&T  Bell 
Laboratories.  A  paper  by  Jagadish  and  Qian  [Jag92]  introduces  integrity  maintenance  in 
Ode  and  focuses  on  inter-object  constraints  including  referential  integrity,  uniqueness 
integrity  and  relational  integrity.  POSTGRES  [Sto87]  and  PARDES  [Etz93a,  Etz93b, 
Etz94]  also  had  modules  and  related  publication  centered  on  this  idea. 

1.1.2  Time  and  Temporal  Issues: 

The  notions  of  events  and  time  are  tightly  related,  and  many  models  allow  the  user 
to  express  (with  more  or  less  simplicity)  temporal  or  time-sensitive  constraints. 
Consequently,  multiple  results  from  temporal  logic  and  temporal  database  research  have 
been  integrated  into  active  database  management  systems  (ADBMS)  models.  Dittrich  and 
Gatziu  [Dit93]  give  an  overview  of  the  useful  extensions  to  the  active  model  if  a  notion 
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of  time  is  supported.  In  a  stiort  paper  [Sis92],  Sistla  and  Wolfson  introduce  the  definition 
of  a  temporal  language  as  an  extension  to  complex  event  definition.  The  authors  discuss 
two  classes  of  temporal  operators:  "past  temporal  language"  and  "future  temporal 
language."  The  operators:  since,  last  time  and  previously  apply  to  the  accumulated  history 
up  to  the  time  when  the  trigger  was  introduced;  while  until,  next  time  and  eventually 
apply  to  incoming  events  from  the  time  when  the  trigger  was  defined.  The  same  authors 
present  some  additional  work  on  temporal  conditions  and  integrity  constraints  in  active 
database  systems  [Sis95a],  and  temporal  triggers  in  active  databases  [Sis95b].  Etzion,  Gal 
and  Segev  [Etz92]  also  discussed  issues  related  to  supporting  temporal  rules  in  the 
PARDES  project.  Dori,  Gal  and  Etzion  wrote  more  recently  on  temporal  active  databases 
as  a  key  to  computer  integrated  manufacturing  [Dor96].  The  POSTGRES  [Sto91]  project 
proposed  the  idea  of  time  travel,  which  allows  users  to  run  historical  queries.  This 
approach  was  implemented  using  a  "no-overwrite"  (also  referred  to  as  append-only) 
storage  management  technique. 

1.1.3  Materialized  View  Maintenance: 

The  importance  of  views  in  query  processing  is  undeniable.  Views  can  remain  as 
predicates  that  are  expanded  at  query  processing  time  (using  query  modification 
techniques  [Sto75]),  or  be  materialized  to  improve  query  response  time.  The  problem  of 
efficiently  and  automatically  maintaining  materialized  views  has  caught  the  interest  of  the 
database  research  community  for  several  years.  Rules  in  active  databases  have  proved  to 
be  of  great  assistance  in  the  process  of  automating  the  maintenance  of  materialized  views. 
The  work  carried  out  by  Ceri  and  Widom  [Cer90,  Cer91]  is  a  typical  application  of  that 
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concept:  The  user  defines  a  view  as  an  SQL  select  expression  and  the  system 
automatically  derives  production  rules  to  maintain  the  materialization  of  that  view.  Other 
examples  include  ideas  from  the  POSTGRES  project  [Sto87],  additional  work  by 
Stonebraker  et  al  [Sto90]  on  rules,  procedures  caching  and  views,  and  a  more  recent 
publication  by  Botzer  and  Etzion  [Bot96].  Furthermore,  the  fairly  recent  emergence  of 
data  warehousing  as  a  research  area  in  the  database  community  has  triggered  a  renewal  of 
interest  for  further  improvements  in  the  processes  of  maintenance  of  materialized  views. 

1.1.4  Advantages  and  Shortcomings  of  Existing  Rule  Systems: 

Because  rule  systems  have  proved  to  be  very  useful  in  business-oriented  as  well  as 
technical  applications,  mostly  from  a  software  engineering  aspect,  an  increasingly  larger 
number  of  (commercial)  database  management  systems  are  now  offering  built-in  support 
for  some  form  of  "trigger"  or  "rule"  definition  capabilities.  Indeed,  by  factoring  out  the 
non-trivial  and  non-easily  maintainable  procedural  coding  issues  of  many  applications, 
rule  systems  have  facilitated  application  development,  scalability  and  maintenance. 
However,  and  despite  those  multiple  advantages,  the  applicability  of  "on-board"  trigger 
systems  in  realistic  environments  is  limited  for  a  number  of  practical  reasons: 

Excessively  simplified  models: 

The  rule  execution  modules  are  often  relafively  too  simple  to  handle  realistic 
condition  testing  cases  because  a  trigger  can  only  be  defined  on  a  single  "table,"  (in 
relational  terminology)  or  data  source  more  generally.  The  definifion  of  a  rule  can 
potentially  be  as  complex  as  a  query  with  multiple  joins  and  selections  conditions. 
Systematically  precompiling  a  complex  rule  definition  as  an  SQL  query  to  be  run  against 
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the  database  at  the  occurrence  of  each  event  can  be  very  expensive  in  terms  of  systems 
resources  used. 

Lack  of  portability  and  flexibility: 

Built-in  trigger  systems  are  very  tightly  coupled  to  the  system  they  were  written 
for.  This  lack  of  modularity  in  the  design  of  active  capabilities  can  cause  important 
roadblocks  in  the  processes  of  application  migration  between  Database  Management 
Systems,  causing  lower  degrees  of  flexibility  and  portability. 

Synchronous  condition  testing: 

Despite  numerous  research  designs  aimed  at  describing  decoupled  or  deferred 
approaches  to  perform  rule  condition  testing  (POSTGRES  [Sto86],  HiPAC  [Cha89, 
Day96]),  many,  if  not  all,  of  the  existing  implementations  of  triggers  still  run  the 
condition  checking  synchronously,  as  part  of  the  user-level  transaction.  While  such  an 
approach  is  acceptable  in  environments  with  limited  numbers  of  rules  of  limited 
complexity,  it  does  not  scale  well  at  all  and  can  cause  significant  delays  in  commit  times, 
and  important  overhead  to  DBMS's  response  times.  A  simple  table  insert  operation  can 
suddenly  become  extremely  lengthy  and  expensive,  when  it  causes  the  firing  of  hundreds 
of  rules. 
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1.2   Extensibility  and  Extensible  Databases  Background: 

As  database  management  systems  gained  popularity  within  the  industry,  the  need 
for  user  defined  data  types,  structures  and  functions  became  more  significant.  Therefore,  a 
second  major  focus  of  the  research  community  in  the  past  few  years  (in  addition  to  adding 
active  capabilities  to  database  management  systems)  was  aimed  at  providing  better  and 
generalized  support  for  nontraditional  applications  as  they  were  becoming  more  popular 
among  users.  Examples  of  such  applications  include  Geographic  Information  Systems 
(GIS),  spatial  packages,  CAD  systems,  multimedia  and  others. 

Integrating  all  the  necessary  code  within  the  kernel  of  the  DBMS  makes  for  large 
monolithic  systems  that  still  could  be  lacking  some  specific  user  requirements.  Since  no 
single  database  system  could  integrate  built-in  support  for  all  possible  applications,  but 
most  of  those  applications  still  reused  an  important  set  of  common  functionalities, 
research  groups  looked  at  ways  of  adding  extensibility  features  to  conventional  database 
systems.  As  accurately  described  in  the  compilation  of  readings  in  database  systems  by 
Stonebraker  [Sto88],  there  are  several  levels  at  which  extensibility  can  be  achieved,  and 
the  corresponding  systems  may  be  classified  in  different  groups.  An  extensible  DBMS 
may  support: 

1.  New  data  types  (such  as  polygons), 

2.  New  operators  on  those  types  (such  as  overlapping), 

3.  New  aggregate  operators  (third  largest), 

4.  New  operators  on  relations  (such  as  the  transitive  closure), 

5.  New  access  methods  (such  as  R-Trees  operating  on  polygons). 
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6.  Modifications  to  the  data  model. 

In  the  most  common  form  of  extensibility  that  has  come  to  be  known  as  Abstract 
Data  Type  (ADT)  support,  we  refer  the  reader  to  projects  such  as  ADT-INGRES 
[Ong84],  RAD  [Osb86],  POSTGRES  [Sto86],  PROBE  [DaySV],  Starburst  [Haa90], 

2  2 

R  D  [Lin88]  and  Sabrina  [Gar89].  This  form  of  extensibility,  pioneered  by  Stonebraker 
et  al  [Sto83],  consists  of  providing  support  for  user  defined  data  types  (and  functions 
operating  on  those  types),  through  a  set  of  precisely  defined  interfaces  within  the  system. 
Using  ADT  support,  users  (programmers)  can  write  modules  defining  new  data  types 
(box,  polygon,  image,  etc.)  and  functions  operating  on  those  types  (intersect,  overlap, 
extract_contours)  and  "hook"  or  "plug"  those  modules  directly  into  the  DBMS 
(essentially  points  1,  2,  3  and  5).  All  the  projects  mentioned  above  adopted  this  model, 
and  differed  only  in  terms  of  basic  data  models,  query  languages  and  ease  of  adding 
extension  modules.  It  is  important  to  note  that  in  this  approach,  the  data  model  itself 
remains  fixed  (usually  referred  to  as  "object-relational"). 

A  more  ambitious  form  of  extensibility  (often  referred  to  as  a  "toolkit"  approach) 
is  that  where  the  data  model  itself  is  considered  as  variable  (points  4  and  6),  has  been 
investigated  in  projects  such  as  EXODUS  [Car86],  GENESIS  [Bat88]  and  DASDBS 
[Pau87,  Sch90].  The  database  kernel  provides  storage  management  and  transaction 
support.  The  application  specific  modules  and  DBMS  layers  are  implemented  "on  top"  of 
that  kernel.  The  "value  added  server"  capability  of  the  SHORE  [Car94]  project  (of  which 
PARADISE  [Dew93,  Dew94]  is  a  good  example)  constitutes  a  good  implementation  of 
this  concept,  carried  out  at  the  University  of  Wisconsin,  Madison.  Object-Oriented 
database  systems  such  as  ORION  [Ban87],  OS  AM  [Su93]  and  others  are  also  extensible 
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by  means  of  their  data  sublanguage  (ie  at  the  datatype  level).  Once  again,  we  refer  the 
interested  reader  to  [Sto88]  for  a  more  comprehensive  discussion  of  this  topic.  We  will 
simply  summarize  here  the  proposed  terminology  translation  table,  between  the  Object 
Oriented  and  the  relational  worlds:  A  class  represents  a  type,  and  an  object  is  an  instance 
of  the  type.  Furthermore,  a  method  and  a  message  represent  respectively  a  function  and  a 
function  call.  Since  extensibility  in  OODBs  relates  to  allowing  the  creation  of  new  classes 
and  new  corresponding  methods,  then  passing  messages  to  the  newly  created  objects,  it 
follows  that  it  is  also  an  ADT  type  of  extensibility,  centered  mainly  on  points  1  and  2. 

Even  though  the  concept  of  extensibility  using  ADT  support  is  less  ambitious  than 
the  "toolkit"  approach  is,  it  still  carries  far-reaching  consequences  for  the  architecture  of 
the  database  system.  As  a  matter  of  fact,  the  DBMS  design  needs  to  be  carefully  thought 
out  so  that  it  is  extensible  at  multiple  levels: 
Language  level: 

At  the  language  level,  the  user  should  be  able  to  declare  new  data  types  and 
functions  (methods,  operators)  and  use  them  in  appropriate  queries  (and  rules). 

Storage  management  and  access  methods  level: 

The  introduction  of  new  types  such  as  polygons  and  boxes  often  requires  the 
definition  of  new  access  methods  allowing  efficient  retrieval  and  indexing  of  objects  of 
such  types.  In  the  case  of  spatial  data,  an  example  of  a  new  access  method  could  be  R- 
Trees  [Gut84]. 


Quen/  optimizer  level: 

The  query  optimizer  is  usually  a  very  complex  component  of  the  database 
management  system  that  is  "closed"  (with  built-in  logic).  The  extensibility  at  the  storage 
management  level  discussed  above  is  totally  useless  if  the  query  optimizer  can  not  be 
made  aware  of  the  existence  of  the  new  access  method,  and  make  proper  use  of  it  while 
generating  plans.  Thus,  the  query  optimizer  architecture  needs  to  be  designed  in  an  open 
format  (sometimes  referred  to  as  "table-driven")  allowing  the  addition  of  new  access 
methods  or  functions  with  their  associated  costs. 

1.3  TriggerMan: 

The  TriggerMan  [Han97a]  project  (under  development  at  the  Database  Research 
Center  at  the  University  of  Florida)  implements  an  asynchronous  and  extensible  trigger 
processor  system.  Many  of  the  design  features  and  concepts  adopted  for  TriggerMan  were 
meant  to  handle  or  reduce  the  significant  shortcomings  of  "traditional"  rule  processors  as 
they  were  described  earlier  in  section  1.1.4,  and  integrate  the  ADT  based  extensibility 
technology  described  in  section  1.2  above.  The  reader  will  find  below  a  brief  description 
of  the  major  design  decisions  that  went  into  TriggerMan. 

Asynchronous  condition  testing: 

TriggerMan  is  an  asynchronous,  external  ("outboard")  trigger  processor,  which 
has  been  designed  to  cooperate  with  various  heterogeneous  data  sources.  Data  sources 
can  be  any  combination  of  legacy  systems,  relational,  object-relational  or  object  oriented 
DBMSs,  simple  unstructured  data  sets  or  applications.  Updates  are  forwarded  to 
TriggerMan  (for  rule  condition  testing)  after  they  have  committed  within  their 
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corresponding  data  source,  hence  the  "asynchronous"  designation.  The  forwarding  of 
updates  is  performed  by  special  data  sources  applications  providing  standard  interfaces 
between  TriggerMan  and  the  data  sources.  From  a  scalability  point  of  view,  this  approach 
is  expected  to  perform  much  better  than  the  (mainstream)  synchronous  one  for  large  sets 
of  rules.  Furthermore,  the  external  aspect  of  TriggerMan  will  allow  it  to  cooperate  with 
multiple  heterogeneous  data  sources  more  easily  than  built-in  trigger  processors  can. 

Use  of  Gator  networks: 

TriggerMan  uses  a  data  structure  called  a  "Gator  network"  to  perform  the  rule 
condition  testing.  The  Gator  network  is  a  generalization  of  the  earlier  Rete  [For82]  and 
TREAT  [Mir87]  discrimination  networks,  and  was  developed  at  the  University  of  Florida 
as  part  of  the  Ariel  project  [Han92,  Han96a].  Discrimination  networks  are  data  structures 
borrowed  from  AI  production  rule  systems  such  as  OPS5  [Bro85].  The  use  of 
discrimination  networks  in  TriggerMan  is  expected  to  overcome  the  simplistic  models  of 
many  trigger  processors,  and  allow  efficient  monitoring  of  sophisticated  conditions  based 
on  multiple  joins,  without  the  added  overhead  associated  with  SQL  query  compilation 
and  execution. 

Temporal  component: 

TriggerMan  incorporates  a  highly  expressive  and  non-procedural  (SQL  like) 
temporal  language  for  rule  definition.  The  user  is  provided  with  an  initial  set  of  built-in 
temporal  functions  (increase(),  decrease()  etc.)  and  completeness  is  achieved  through  the 
extensibility  feature,  which  enables  users  to  write  and  plug-in  a  new  temporal  function 
the  same  way  new  data  types  and  methods  are  [Alfa98]. 
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The  architecture  of  the  environment  in  which  TriggerMan  operates,  as  well  as  a 
brief  overview  of  the  decisions  that  were  made  regarding  the  implementation  issues,  are 
covered  next. 

1.3.1  The  TriggerMan  Environment: 

As  described  in  Figure  1-1  below,  the  architecture  of  the  TriggerMan  environment 
consist  of  one  or  more  external  data  sources  "feeding"  the  Asynchronous  Trigger 
Processor  with  data  consisting  of  "update  descriptors".  Those  external  data  sources  can  be 
of  multiple  types: 

-  An  application  generating  real  time  data  (stock  exchange  ticker  etc.). 

-  A  replication  server  reading  data  from  a  DBMS  (such  as  Sybase  replication  server 
[Syb96]). 

-  A  DBMS  application  that  uses  simple  built-in  triggers  to  catch  local  updates,  and 
forwards  them. 

-  A  non-structured  data  source  such  as  a  spreadsheet  or  a  word  processor  file. 

A  "data  source  program"  application  will  act  as  an  interface  between  TriggerMan 
and  the  external  data  sources.  Depending  on  the  inherent  characteristics  of  those  external 
data  sources  (structured,  queriable,  active  etc.),  the  role  and  complexity  of  the  data  source 
program  will  vary  on  a  per-case  basis.  Hence,  a  TriggerMan  data  source  application  may 
range  from  a  process  with  advanced  logic,  regularly  performing  differencing  on  a  passive 
non-structured,  non  queriable  source,  to  a  much  simpler  "forwarding"  layer  (Figure  1-1). 
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Figure  1 


-1  The  architecture  of  the  TriggerMan  environment. 
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In  addition  to  providing  TriggerMan  with  data  (update  descriptors)  for  rule 
condition  testing,  data  source  applications  may  perform  some  processing  on  the  formats 
of  those  update  descriptors  in  order  to  convert  them  to  a  format  "understood"  by 
TriggerMan  when  (and  if)  necessary.  The  conversion  will  be  performed  by  a  "translator" 
or  "translation  layer"  reading  the  external  data  source  format  and  calling  a  precisely 
defined  API,  in  order  to  generate  the  tokens  in  the  internal  format  required  by  TriggerMan 
as  illustrated  in  Figure  1-1.  A  number  of  client  applications  may  then  register  with 
TriggerMan  for  particular  event  notifications  that  can  be  raised  as  part  of  rule  actions. 
When  events  are  raised,  each  application  that  registered  for  it  is  notified.  A  console 
application  is  used  as  the  "administrative"  interface  to  the  system,  allowing  startups, 
shutdowns,  activation,  deactivation  of  triggers  and  other  administrative  tasks. 

1.3.2  The  TriggerMan  Physical  Architecture: 

The  design  team  was  faced  with  a  few  possible  implementation  strategies  for  the 
first  prototype  of  TriggerMan.  Each  of  the  envisaged  approaches  was  fully  compatible 
with  the  logical  architecture  described  in  section  1.3.1,  but  the  required  amounts  of 
implementation  and  coding,  and  the  capabilities  and  speed  of  the  resulting  system  varied 
substantially  depending  on  the  chosen  scheme. 

Stand-alone  version  with  local  storage  management: 

In  this  strategy,  TriggerMan  is  to  be  implemented  as  a  stand-alone  application, 
using  a  local  storage  manager  supporting  transactions  and  recovery  management. 
Parallelism  support  being  an  integral  part  of  the  design  of  the  project,  the  overall  physical 
architecture  is  described  in  Figure  1-2  below: 
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Figure  1-2  The  architecture  of  TriggerMan 
In  an  ideal  parallel  configuration  of  this  architecture,  a  copy  of  the  TriggerMan 
server  runs  on  each  node  of  a  shared-nothing  architecture.  The  nodes  of  the  shared- 
nothing  environment  are  connected  with  a  high-speed  medium  (bus  or  fast  network)  and 
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each  of  them  can  either  be  a  single  processor  machine,  or  an  SMP  cluster  grouping  a 
number  of  processors  sharing  a  local  memory  area.  A  Virtual  Processor  (VProc) 
abstraction  concept,  embedded  within  TriggerMan  will  map  VProcs  to  real  processors  in 
a  flexible  way,  allowing  high  fault  tolerance  and  dynamic  load  sharing  capabilities  within 
the  server.  Each  VProc  runs  a  pool  of  threads  handling  various  tasks  including: 

-  Local  rule  condition  testing. 

-  Tokens  exchanges  with  other  VProcs  according  to  the  data-partitioning  scheme. 

-  Possible  local  basic  storage  management. 

-  Caching  strategies  of  the  local  partitions  of  the  memory  nodes. 

Although  this  approach  would  have  represented  the  ideal  implementation,  given 
the  unavailability  of  an  open  and  reliable  storage  management  layer,  and  considering  the 
tremendous  amount  of  development  effort  required  to  write  one,  it  was  decided  not  to  try 
to  utilize  this  architecture. 

Standalone  version  with  external  storage  management: 

This  approach  is  very  similar  to  the  previous  one  and  differs  only  in  the  storage 
management  strategy.  Instead  of  relying  on  its  own  storage  manager,  TriggerMan  would 
use  a  commercial  DBMS  for  storage  purposes,  using  ODBC  or  any  other  message  passing 
interface  which  could  be  made  available.  While  this  plan  of  action  was  technically 
feasible,  it  suffered  from  serious  performance  limitations  in  handling  large  amounts  of 
data  through  the  message  passing  type  of  communication.  Furthermore,  a  few  additional 
technical  problems  would  have  increased  the  internal  complexity  of  TriggerMan  without 
any  major  benefit.  It  was  therefore  decided  to  put  this  scheme  aside. 
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Plug-in  within  an  extensible  DBMS: 

A  middle  ground  strategy,  procuring  a  reliable  storage  management  layer  for 
TriggerMan,  without  extensive  code  writing  (or  excessive  communication  overhead),  was 
to  develop  the  project  as  an  extension  to  an  object-relational  DBMS  with  ADT  support. 
The  advantages  of  this  strategy  are  multiple  and  can  be  summarized  as  follows: 

-  As  a  plug-in  module  running  in  the  same  address  space  as  the  DBMS,  TriggerMan 
would  benefit  from  "commercial  strength"  local  storage  management,  without  the 
overhead  associated  with  message  passing. 

-  Any  support  of  parallelism  within  the  DBMS  is  freely  available  to  TriggerMan 
through  the  transparent  SQL  "call-back"  interface. 
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Figure  1-3  TriggerMan  as  an  Informix  Datablade 
Figure  1-3  above  describes  the  implementation  strategy  that  has  been  chosen  to 
develop  the  first  prototype  of  TriggerMan,  as  an  Informix  Dynamic  Server  with  Universal 
Data  Option  (IDSAJDO)  Datablade.  The  TriggerMan  catalogs  and  rules  are  stored  within 
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the  Informix  server  in  regular  "tables".  The  access  to  those  elements  is  made  through 
standard  SQL  calls  made  to  IDS/UDO. 

Because  of  technical  implementation  issues  related  to  Informix  threads  handling 
in  datablades,  it  was  decided  to  keep  the  TriggerMan  code  passive,  and  signal  it  on  a 
regular  basis  through  an  external  simple  timer  application:  the  TriggerMan  driver. 


CHAPTER  2 
PROBLEM  STATEMENT 


As  mentioned  in  the  introduction,  TriggerMan  relies  on  Gator  networks  to 
perform  rule  condition  testing  within  the  context  of  an  Object-Relational  Database 
Management  system  and  environment.  In  this  chapter,  the  reader  will  find  a  brief  tutorial 
of  discrimination  networks,  and  a  precise  description  of  the  specific  issues  that  this 
dissertation  will  cover. 

2. 1   A  Discrimination  Network  Discussion  and  Example: 

For  the  reader  who  is  unfamiliar  with  discrimination  (and  Gator)  networks,  this 
section  is  meant  to  present  a  brief  tutorial  of  these  data  structures  through  a  general 
discussion  of  their  characteristics.  A  simple  example  presented  below  will  be  referred  to 
throughout  the  discussion  for  illustration  purposes.  The  reader  who  is  already  familiar 
with  discrimination  networks  may  skip  this  section. 
Consider  a  simple  two-table  schema: 
EMPLOYEE  (SSN#,  Name,  Age,  Salary,  Dno) 
DEPARTMENT  (Dno,  Dname) 

And  a  rule  condition  defined  on  the  schema  above  as  follows: 

"If  the  record  of  a  Sales  department  employee  with  a  salary  higher  than  $100,000  is 
accessed  (inserted  or  modified),  then  notify  Bob,  the  manager  of  the  department." 
This  condition  could  be  defined  in  a  rule  language  as  follows: 
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Create  Trigger  HighSalesSalary 

From  EMPLOYEE  as  EMP,  DEPARTMENT  as  DEP 

When  EMRSalary  >=  100000 

And  EMP.Dno  =  DEP.Dno 

And  DEP.Dname  =  'Sales' 

Then  raiseEvent  HighSalesSalaryUpdate("Bob",  EMP.Dno). 


Just  as  queries  can  be  represented  as  graphs  referred  to  as  "query  graphs,"  rules  can  also 
be  represented  in  a  similar  way:  the  rule  graph.  A  rule  processor  or  active  database  will 
then  internally  represent  the  trigger  HighSalesSalary  defined  above,  in  the  format  of  a 
graph  such  as  the  one  drawn  in  Figure  2-1  below: 


Where: 

-  EMP  is  a  tuple  variable  that  represents  the  table  EMPLOYEE 

-  DEP  is  a  tuple  variable  that  represents  the  table  DEPARTMENT 

-  CTi  represents  the  selection  "Salary  >  100000" 

-  02  represents  the  selection  Dname  =  'Sales' 

-  The  horizontal  edge  between  EMP  and  DEP  represents  the  join  condition  written  just 


Ci:  Salary  >  100,000 


ay.  Dname  =  Sales 


Figure  2-1  Rule  Graph  for  Rule  HighSalesSalary 


above  it. 
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A  corresponding  discrimination  network  would  be  that  given  in  Figure  2-2  below: 


Figure  2-2  Discrimination  network  for  rule  HighSalesSalary 

In  Figure  2-2: 

tti  is  derived  from  tuple  variable  EMP  from  the  rule  graph  in  Figure  2-1 . 

a2  is  derived  from  tuple  variable  DEP  from  the  rule  graph  in  Figure  2-1. 

Oi  and  (72  have  the  same  meanings  as  in  the  rule  graph  from  Figure  2-1 . 

-  The  edge  drawn  between  a\  and  ai  represents  the  join  condition  between  EMP  and 
DEP  from  the  rule  graph.  By  convention,  join  conditions  will  be  represented  as 
dashed  lines  joining  sibling  nodes.  Sibling  nodes  are  always  at  the  same  distance 
from  the  Pnode. 

-  The  full  arrows  represent  a  child  ->  parent  relationship.  The  parent  node  stores  the 
results  of  the  joins  across  all  of  its  children.  The  arrows  have  been  drawn  here  for 
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illustration  purposes,  but  typically,  simple  lines  or  segments  joining  the  child  to  its 
parent  will  replace  them. 

The  routing  node  and  the  dashed  arrows  are  also  drawn  here  for  illustration  and 
tutorial  purposes.  The  dashed  arrows  represent  the  routing  capabilities  of  the  routing 
node  to  send  incoming  data  to  the  right  alpha  node.  The  presence  of  the  routing  node 
and  the  dashed  arrows  will  be  implicit  in  all  further  discrimination  and  Gator 
networks  figures. 

General  structure  of  a  discrimination  network: 

In  general,  a  discrimination  network  can  contain  the  following  types  of  memory ^ 

nodes: 

-  A  Pnode,  which  is  the  actual  root  of  the  tree  representing  the  discrimination  network. 
As  we  will  see  very  shortly,  data  (tokens)  are  propagated  through  the  network,  and 
the  semantics  of  discrimination  networks  is  such  that  when  a  token  reaches  the 
Pnode,  all  the  rule  conditions  have  been  met  and  the  trigger  fires. 

-  A  set  of  alpha  nodes,  one  for  each  tuple  variable  from  the  query  graph  of  the  rule. 

-  A  set  of  beta  nodes.  Beta  nodes  are  intermediate  nodes  used  to  store  the  results  of 
two  or  more  alphas  and/or  betas.  (Since  the  current  example  has  only  one  join 
condition,  it  does  not  contain  any  beta  node.  A  more  general  network  with  beta  nodes 
will  be  presented  at  the  end  of  this  discussion). 

-  The  "Routing"  node  was  sometimes  called  a  "Root"  node  in  earlier  papers,  but  it  is 
not  to  be  confused  with  the  Pnode  of  the  network.  The  "Routing"  node  redirects 

^  Despite  their  denomination,  memory  nodes  can  be  (and  often  are)  stored  on  disk. 
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incoming  tokens  to  the  appropriate  a  nodes  for  further  processing.  We  will  therefore 

use  the  term  "Root"  only  to  refer  to  the  Pnode. 

Now  that  the  structure  of  discrimination  networks  has  been  described,  it  is 
necessary  to  present  the  dynamic  aspect  of  these  structures,  represented  by  the  token 
propagation  that  was  briefly  mentioned  a  few  lines  above. 

Token  propagation  in  discrimination  networks: 

The  process  of  testing  a  rule  condition  when  an  update  occurs  (in  the  database)  is 
sequenced  in  the  following  way:  each  access  to  a  relation  appearing  in  a  rule  condition 
will  generate  the  creation  of  an  update  descriptor.  An  update  descriptor  (also  referred  to 
as  a  token)  is  simply  the  tuple  that  was  accessed,  to  which  a  few  auxiliary  fields  are 
added  for  rule  processing  purposes.  The  token  in  question  will  be  transferred  to  the 
Routing  node  of  the  discrimination  network(s)  representing  the  rules  defined  on  the 
current  schema.  Based  on  a  table  (or  data  source)  field  in  the  header  of  the  token,  it  will 
be  forwarded  to  the  appropriate  a  node  and  propagated  down  the  network  from  there  (for 
testing  the  rule  condition).  Consider  again  the  example  of  the  schema  described  at  the 
beginning  of  this  section;  each  access  to  relations  EMPLOYEE  or  DEPARTMENT  will 
therefore  generate  the  creation  of  an  update  descriptor  that  will  be  transferred  to  the 
Routing  node  described  in  Figure  2-2  above. 
Suppose  that  table  DEPARTMENT  is  the  following: 


Dno 

Dname 

1 

Finance 

2 

Sales 

3 

Research  &  Development 
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And  a  tuple  (SSN  =  1,  Name  =  "Bob",  Age  =  30,  Salary  =  150000,  Dno  =  2)  was  inserted 
in  table  EMPLOYEE.  An  update  descriptor  of  the  form  [EMPLOYEE,  INSERT,  1, 
"Bob",  30,  150  000,  2]  is  generated  and  sent  to  the  routing  node  of  the  discrimination 
network.  The  same  token  is  then  forwarded  to  ai  based  on  the  EMPLOYEE  field.  The 
field  INSERT  is  used  to  determine  the  type  of  database  operation  that  was  performed, 
since  that  can  be  part  of  the  rule  condition  (on  INSERT,  on  DELETE  etc.),  and  filtered 
out  by  the  routing  node  if  necessary.  The  selection  predicate  represented  by  <j\  (salary  > 
100000)  is  then  tested  and  succeeds.  The  update  descriptor  is  therefore  inserted  into  the 
node  tti  where  the  join  condition  with  the  departments  is  checked.  Since  the  current 
department  is  "Sales,"  and  the  Dno  for  Sales  is  "2,"  the  join  succeeds  too  and  a  resulting 
compound  token  (consisting  of  the  concatenation  of  both  tokens  from  ai  and  ai)  is 
created  and  inserted  into  the  Pnode.  As  mentioned  earlier,  the  insertion  of  a  new  token 
into  the  Pnode  signals  the  matching  of  the  rule  condition  for  the  current  database  access. 
The  network  used  for  this  particular  example  was  simple  enough  not  to  require  any  beta 
node  to  store  intermediate  joins  results.  In  general,  rules  with  three  or  more  tuple 
variables  will  generate  networks  with  three  or  more  alpha  nodes,  and  the  presence  of  beta 
nodes  will  become  likely.  Consider  the  network  below: 


Pnode 
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In  this  case,  the  rule  has  six  tuple  variables  (six  alpha  nodes)  and  three  selection 
predicates.  When  the  network  is  first  created,  it  is  "primed,"  which  means  that  all  nodes 
are  loaded  with  their  respective  data.  Alphas  are  primed  from  the  relations  they  are 
derived  from  (with  the  selection  condition  applied),  and  betas  are  primed  by  executing  the 
join  operation  across  all  their  children.  (The  Pnode  is  always  emptied  after  each  rule 
condition  testing  and  does  not  require  priming).  The  results  of  the  joins  between  the  first 
three  alphas  are  stored  in  Pi  while  P2  stores  the  results  of  the  joins  between  04  and  aj. 
The  Pnode  contains  the  result  of  joins  between  Pi,  P2  and  Og.  The  propagation  of  tokens 
is  applied  in  the  same  way  it  was  described  earlier.  Each  time  a  token  is  inserted  into  a 
node,  it  is  joined  with  all  the  siblings  of  that  node  (according  to  a  predefined  ordering  of 
those)  and  the  result  of  that  join  (potentially  many  new  "compound  tokens"),  is  inserted 
into  the  parent  node  and  so  on.  As  mentioned  earlier,  a  compound  token  is  simply  the 
concatenation  of  the  tokens  that  joined  from  each  of  the  sibling  nodes.  As  an  example, 
consider  an  update  descriptor  originating  from  the  relation  from  which  a2  is  derived.  The 
routing  node  (not  drawn)  routes  the  token  to  a2,  which  causes  Cz  to  be  tested  first.  If  the 
test  succeeds,  the  token  is  inserted  into  az  and  joined  with  ttj  and  (the  joining  order  is 
part  of  the  optimization  algorithm  and  is  hard  coded  within  the  "join  plan"  of  each  node). 
All  the  tokens  generated  by  this  join  are  inserted  into  Pi  and  joined  with  P2  and  OLe.  Any 
resulting  token  is  inserted  into  the  Pnode  and  causes  the  rule  to  fire.  Note  that  all  tokens 
inserted  into  the  Pnode  do  not  remain  there,  but  are  deleted  when  the  rule  executes.  Those 
tokens  are  needed  in  the  Pnode  for  condition/action  binding  purposes,  when  the  action  of 
a  rule  refers  to  the  data  that  had  caused  it  to  fire. 
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2.2   Need  for  Selection  Predicate  Pushdown: 

In  the  extensible  and  object-relational  environment  in  which  TriggerMan  operates, 
selection  predicates  can  be  arbitrary  user-defined  functions  or  methods.  Suppose  that  the 
table  EMPLOYEE  had  one  additional  field  "fingerprint"  of  type  "image"  or  "vector 
data."  Suppose  furthermore  that  the  (selection)  predicate  C\  is  no  longer  a  condition  on 
the  salary,  but  a  complex  operation  executed  by  one  of  the  image  (vector  data)  data  type 
methods  (such  as  matching  fingerprints  etc.).  The  computation  of  a\  now  becomes 
expensive  enough  for  the  cost  of  the  join  operation  tti.Dno  =  aa.Dno  to  become 
negligible  in  comparison  with  it.  In  such  a  case,  it  would  probably  be  a  good  optimization 
strategy  to  perform  the  join  operation  first,  and  delay  the  selection  evaluation  as  much  as 
possible.  A  "Push  Down"  operation  is  then  performed  on  d,  which  is  replaced  by  a 
"True"  operator  at  the  leaf  level,  in  order  to  allow  all  tokens  to  get  through.  The  resulting 
network  is  shown  in  Figure  2-3  below: 


Figure  2-3  Selection  pushdown 
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Indeed,  the  potential  reduction  of  the  number  of  tokens  flowing  through  C\  represents  in 
turn  a  potential  optimization  of  this  network  with  respect  to  the  one  described  earlier  in 
Figure  2-2. 
Remark: 

Since  Gator  networks  are  always  represented  with  the  root  node  (Pnode)  at  the 
bottom  and  the  leaves  (alpha  nodes)  at  the  top,  a  selection  pushdown  means  moving  the 
sigma  node  so  that  it  becomes  closer  to  the  root  of  the  network.  Since  query  trees  are 
usually  represented  the  other  way  around,  with  their  root  at  the  top,  the  pushdown  and 
pull-up  operations  in  query  optimization  [Hel92,  Hel93]  have  opposite  meanings  to  those 
used  in  discrimination  network  optimization  strategies. 

All  Gator  networks  designs  and  implementation  have  traditionally  placed 
selection  predicates  at  their  leaves  so  that  they  can  be  performed  first.  While  this 
approach  proved  worthwhile  and  effective  in  existing  optimization  strategies,  the  ability 
of  TriggerMan  to  operate  in  extensible  environments  makes  this  heuristic  no  longer  true. 
As  a  matter  of  fact,  selection  predicates  defined  on  objects  such  as  polygons  or  images 
may  prove  to  be  more  expensive  than  (some)  joins,  and  a  more  elaborate  heuristic  (or  set 
of  heuristics)  for  selection  placement  in  the  discrimination  network  will  become 
necessary. 

The  intention  of  this  dissertation  is  to  investigate  the  approaches  that  can  be  taken 
towards  achieving  an  optimal  or  near  optimal  placement  of  selection  predicates  in 
discrimination  networks  in  general,  and  more  particularly  in  Gator  networks.  The 
organization  of  the  rest  of  the  chapters  of  this  work  is  structured  as  follows: 
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In  Chapter  3,  the  reader  will  be  familiarized  with  the  most  important  aspects  of  the 
existing  Gator  network  optimizer  as  it  is  implemented  in  the  Ariel  project  [Han92]. 
Chapter  4  will  cover  the  related  work  that  was  accomplished  in  selection  predicate 
placement  in  query  processing,  and  show  how  those  results  cannot  be  applied  to 
discrimination  networks  optimization  strategies.  In  Chapter  5,  we  will  propose  an 
alternative  set  of  strategies  extending  the  existent  Gator  networks  optimizer  that  was 
described  in  Chapter  3,  and  analyze  the  consequences  of  each  of  them  (on  the  current 
optimizer)  mainly  from  a  conceptual  point  of  view.  Chapter  6  will  describe  the  practical 
issues  that  arise  during  the  implementation  of  the  strategies  described  in  Chapter  5,  and 
present  the  practical  results  that  were  obtained.  In  Chapter  7,  we  will  conclude  with  a 
brief  review  of  this  work  and  present  additional  related  issues  and  open  research 
questions. 


CHAPTER  3 

BACKGROUND  ON  GATOR  NETWORK  OPTIMIZATION 

TREAT  [Mir87]  and  Rete  [For82]  networks  have  been  successful  approaches  for 
performing  non-procedural  rule  matching  in  active  databases.  Treat  networks  do  not 
make  use  of  the  intermediate  beta  nodes  described  earlier,  and  have  all  their  alpha 
memory  nodes  feed  directly  into  the  Pnode.  Therefore,  Treat  networks  are  always  two- 
level  networks  (The  HighSalesSalary  network  shown  above  (Figure  2-2)  was  a  TREAT 
network  with  a  single  join  condition).  The  Rete  network  uses  betas  that  can  store  the 
results  of  intermediate  two-way  joins  only.  In  other  words,  a  beta  node  never  has  more 
than  two  children  in  a  Rete  network. 

Gator  networks  extend  the  shapes  of  discrimination  networks  to  general  shapes  by 
allowing  betas  to  have  N  other  nodes  as  direct  children.  As  a  result,  TREAT  and  Rete 
become  special  cases  of  the  more  general  Gator  structure. 

Because  of  its  generalized  shape,  the  optimization  of  a  Gator  network  becomes  an 
expensive  procedure  with  the  increase  of  the  number  of  tuple  variables  involved  in  the 
query  graph.  As  a  matter  of  fact,  it  is  easy  to  see  that  the  number  of  alternatives  for 
building  a  Gator  network  increases  extremely  rapidly  with  the  number  of  alpha  nodes. 
Since  our  approach  consists  of  an  extension  of  the  existing  cost  model,  we  will  present  a 
brief  review  of  the  current  Gator  network  optimization  strategy  before  presenting  the 
selection  predicate  placement  strategy  itself. 
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3. 1    Brief  Review  of  the  Gator  Network  Optimization  Algorithm: 

Randomized  algorithms  have  been  noticed  to  produce  very  acceptable  solutions 
(nearly  optimal)  with  computation  or  execution  times  that  are  orders  of  magnitude 
smaller  than  dynamic  programming  approaches,  when  dealing  with  exponential  search 
spaces.  Such  algorithms  include: 
Simulated  Annealing  [Ioa87], 

-  Iterative  hnprovement  [Swa88], 

-  Two-phase  Optimization,  which  combines  the  two  previous  algorithms. 

The  existing  implementation  in  the  Ariel[Han89,  Han96,  Has93]  project  includes 
the  use  of  such  algorithms  for  the  generation  of  Gator  networks.  The  reader  will 
undoubtedly  notice  that  the  used  approach  is  a  greedy  algorithm,  starting  with  an  initial 
solution  and  trying  to  improve  it  at  each  step  of  the  optimization  process.  The  solution  is 
of  course  prone  to  the  local  minimum  problem  common  to  all  greedy  algorithms,  but  the 
use  of  simulated  annealing  and  similar  techniques  (such  as  using  multiple  starting  points) 
are  meant  to  reduce,  if  not  eliminate,  this  problem. 

The  remaining  part  of  this  chapter  gives  a  general  overview  of  the  existing 
implementation  of  Gator  network  optimizers  carried  out  at  the  University  of  Florida. 
Below  we  describe  a  typical  randomized  algorithm  (As  a  function  named 
GetOptimalGator)  described  in  high-level  pseudo  code,  then  we  give  a  brief  overview  of 
the  proposed  cost  model. 
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Gator*  GetOptimalGator  (RuleGraph*  Graph) 


Begin 


BestSolution  =  BuildlnitialGatorNetwork(Graph) 
While  (no  satisfying  solution  found) 

New-Gator  =  ApplyLocalPerturbation  (BestSolution) 
If  (Cost  (New-Gator)  <  Cost  (BestSolution) )  then 
DeleteGator  (BestSolution) 
BestSolution  =  New-Gator 

ilse 

DeleteGator  (New-Gator) 

Endif 
EndWhile 

Return  (BestSolution) 


End 


The  exit  condition  from  the  While  loop  described  as  "no  satisfying  solution  found"  can  be 
implemented  in  different  ways  according  to  the  chosen  solution  used  for  the  randomized 
approach.  Possible  conditions  for  a  satisfying  solution  include: 
Perform  a  specified  number  of  iterations  then  exit. 

-  Iterate  until  the  cost  drops  below  some  predefined  threshold  then  exit. 

-  Iterate  until  a  certain  number  of  operators  called  successively  fail  to  improve  the 
network. 

A  combination  of  both  previous  cases:  iterate  until  a  specified  number  of  iterations 
are  reached  or  the  cost  drops  below  a  certain  limit,  whichever  comes  first. 

The  next  algorithm  gives  a  description  of  the  function  BuildlnitialGatorNetwork 
that  is  called  at  the  beginning  of  GetOptimalGator  described  above. 
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Gator*  BuildlnitialGatorNetwork  (RuleGraph*  Graph) 
Begin 

LocalList  =  New  List 
NeighborsList  =  New  List 

For  (each  tuple  variable  in  Graph) 
AlphaNode  =  new  Alpha 
If  (tuple  variable  has  a  a  with  selectivity  s) 
AlphaNode^  selectivity  =  s 

Endif 

LocalList  ^Insert  (AlphaNode) 

EndFor 

While  (LocalList  not  empty) 

Node  1  =  Randomly  select  one  element  from  LocalList 
NeighborsList  =  FindNeighbors  (Nodel,  LocalList) 
k  =  Size  of  (NeighborsList) 

Node2  =  Randomly  select  an  element  from  NeighborsList 
BetaNode  =  Combine  (Nodel,  Node2) 
LocalList  ->  remove  (Nodel) 
LocalList     remove  (Node2) 
LocalList     insert  (BetaNode) 
EndWhile 

End 

The  function  FindNeighbors  simply  walks  down  the  list  of  nodes  and  returns  all  the 
nodes  that  have  a  direct  edge  connection  with  the  node  currently  passed  as  a  parameter. 

As  the  reader  may  have  already  noticed,  the  Gator  network  is  incrementally  built 
using  the  Combine(Node*  Node*)  function.  This  function  implements  a  set  of  three  basic 
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combination  operators  for  nodes,  whicii  incrementally  return  larger  networks.  Below  is  a 
description  of  the  operators  implemented  in  the  function  Combine. 
3.1.1  Combine  Operators: 

The  three  operators,  Join  Absorb  and  Merge  are  shown  in  this  section. 

Join: 

Two  existing  nodes  are  made  children  of  a  newly  created  one  (beta),  the  new  network  is 
one  level  deeper. 


Absorb; 

The  white  node  is  absorbed  and  becomes  child  of  an  existing  P  node. 


Merge: 

Merge  applies  only  to  two  P  nodes. 
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Once  the  initial  network  is  randomly  built  using  the  operators  above,  it  is  improved 
gradually  by  using  a  set  of  three  local  change  operators  "Create  Beta,"  "Kill  Beta"  and 
"Merge  Sibling"  described  in  the  following  section. 
3.1.2  Local  Change  Operators: 

Gator*  ApplyLocalPerturbation  (Gator*  Current) 


Begin 


End 


Perturbation  =  Choose  Randomly  In  {CreateBeta,  KillBeta,  MergeSibling} 
New-Gator  =  ApplyPerturbation  (Current,  Perturbation) 
Return  (New-Gator) 


Gator*  ApplyPerturbation  (Gator*  Current,  char*  perturbation) 

A  set  of  three  local  perturbation  operators  has  been  specified.  At  each  iteration,  one  of  the 

operators  is  randomly  chosen  and  applied  to  the  network.  Below  is  a  description  of  those 

local  operators. 

Create  Beta: 


tt''   "a     ,a  ,a 


Pnode 


Pnode 


Figure  3-1  Local  change  operator  "Create  Beta" 
The  Create  Beta  operator  requires  a  node  with  at  least  three  children.  In  the 
example  above,  the  node  in  question  is  the  Pnode  itself,  but  it  can  potentially  be  any  P  in 
the  network  or  the  Pnode. 
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Kill  Beta: 


Pnode 


Pnode 


Figure  3-2  Local  change  operator  "Kill  Beta" 
The  beta  is  killed,  and  all  its  children  become  direct  children  of  the  (old)  parent  of  that 
beta.  In  the  example  above,  all  four  alphas  become  children  of  the  Pnode. 
Merge  Sibling: 


Pnode 

Figure  3-3  Local  change  operator  "Merge  Sibling 


Pnode 


The  merge  sibling  operator  requires  a  node  to  have  at  least  three  children  of  which  at  least 
one  is  a  P  (in  the  example  above,  the  Pnode  has  been  selected  again  as  such  a  node  but 
that  is  not  a  necessity).  In  the  case  where  more  than  one  (3  node  is  available,  one  of  them 
is  randomly  chosen  as  a  "MergeNode."  One  of  the  siblings  of  the  "MergeNode"  is  then 
selected  and  moved  to  a  child  position  with  respect  to  "MergeNode"  as  shown  in  Figure 
3-3  above. 
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3.2   The  Cost  Model: 

The  comprehensive  description  of  the  Gator  discrimination  network  cost  model  is 
somewhat  lengthy  and  will  not  be  covered  in  detail.  A  complete  reference  can  be  found 
in  [Han96b].  In  this  section,  the  reader  will  find  a  short  overview  of  the  cost  model, 
reviewing  the  cases  where  all  nodes  are  stored  and  do  not  have  any  indexes  defined  on 
any  of  the  attributes.  Since  the  use  of  indexes  and  virtual  nodes  does  not  have  any 
influence  on  the  remainder  of  this  work,  there  will  be  no  particular  need  in  covering 
extensively  all  the  cases. 


Notation: 

CPUweight  The  relative  weight  of  the  CPU  utilization. 

I/Oweight  The  relative  weight  of  the  I/O  operations. 

N  A  node  in  the  discrimination  network  (a,  P  or  Pnode). 

Fi(N)  The  insert  frequency  in  node  N. 

Fd(N)  The  delete  frequency  in  node  N. 

Card(N)  The  cardinality  of  node  N. 

Ci(N)  Insertion  cost  in  node  N. 

Cd(N)  Deletion  cost  in  node  N. 

Sel(A)  Selectivity  of  the  predicate/condition  A. 

JSF(A,B)  Join  selectivity  factor  between  nodes  A  and  B. 
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The  network  below  will  be  used  as  a  reference  example  that  we  will  refer  to  throughout 
the  description  of  the  cost  formulas  in  the  remaining  sections  of  this  chapter. 

Oi      True      (52    Tme    O3  True 


3.2.1  Cost  Functions  for  a  Nodes: 

The  cardinality  of  an  alpha  node  is  estimated  as  the  product  of  the  cardinality  of 
the  relation  on  which  the  node  is  defined  and  the  selectivity  Sel(a)  of  the  selection 
predicate  defined  over  the  same  relation.  If  there  is  no  selection  defined  on  the  relation,  <y 
is  considered  to  be  the  logical  "TRUE"  predicate  and  Sel(a)  is  taken  to  be  1. 
The  cost  of  an  alpha  node  is  defined  by 

Cost  (a)  =  Ci(a)  +  Cd(a) 
Note:  As  indicated  at  the  beginning  of  this  section,  we  will  only  consider  here  the  case  of 
stored  alpha  nodes  with  no  index  defined  on  any  of  the  attributes.  (The  use  of  indexes  is 
not  crucial  in  the  context  of  this  work.  The  reader  will  find  an  elaboration  on  this  aspect 
in  chapter  5). 


Figure  3-4  Example  of  a  Gator  network 
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Cost  of  insert  in  the  alpha  node: 

The  insertion  of  a  new  tuple  in  a  relation  requires  one  disk  read  and  one  disk 
write,  therefore  two  I/Os  do  occur.  Since  only  one  tuple  is  inserted,  CPUweight  is 
incurred  once. 

Ci(a)  =  (CPUweight  +  2  *  I/Oweight)  *  Fi(a) 

Cost  of  delete  from  alpha  nodes: 

In  order  to  delete  a  tuple  from  an  alpha  node,  all  the  pages  of  the  alpha  node  are 
read  once  and  tuples  in  those  pages  are  scanned  in  order  to  identify  the  one  to  delete. 
Then  the  page  where  the  tuple  was  rewritten  is  written  back  to  disk. 

Cd(a)  =  [CPUweight  *  Card(a)  +  I/Oweight  *  (Page(a)  +  1)]  *  Fd(a) 

3.2.2  Cost  of  Generating  Temporary  Join  Results,  (TR): 

When  a  token  is  inserted  in  a  node  with  multiple  siblings,  a  sequence  of  two-way 
joins  is  executed  until  the  complete  join  is  computed.  The  order  in  which  the  joins  are 
executed  is  predefined  in  the  "join  plan"  of  each  node.  The  join  plan  is  computed  when 
the  network  is  built,  as  the  sequence  generating  the  smallest  possible  intermediate  join 
results:  TRn's.  Li  the  example  network  above,  an  insertion  in  a2  would  join  with  either  tti 
or  a3  (depending  on  the  join  plan  of  ai),  then  the  resulting  set  of  compound  tokens  would 
be  joined  to  the  remaining  sibling  in  the  join  plan.  The  formulas  below  estimate  the  cost 
of  generating  these  sequences  of  temporary  joins  for  both  alpha  and  beta  nodes. 
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Stored  Alpha  nodes  -  No  index  on  join  attribute: 

Assuming  the  system  has  enough  memory,  the  cost  will  consist  of  reading  the 
alpha  node  from  disk  and  performing  the  join  of  TRn  with  the  alpha  node: 

Cj(TRn,  a)  =  I/Oweight  *  Pages(a)  +  CPUweight  *  Card(a)  *  Card(TRn) 

With  TRo=  1, 

Card  (TRn)=  Card  (TRn-i)  *  JSF(Nn.i  ^  N„). 
Note:  Nn-i  ->  Nn  represents  the  two  nodes  in  sequence  in  the  join  plan  of  the  alpha  node 
where  the  token  was  initially  inserted. 

Beta  nodes  -  No  join  attribute  index: 

The  cost  of  joining  from  a  TRn  to  a  beta  node  (without  indexes)  is  similar  to  that 
of  joining  with  an  alpha  node.  The  cost  formula  is  the  same  as  the  previous  one,  where 
alphas  are  replaced  with  beta  nodes. 

Cj(TR„,  P)  =  FOweight  *  Pages(p)  +  CPUweight  *  Card(p)  *  TR„ 
With  TRo  =  1 

Card(TRn)=  Card(TR„.,)  *  JSF(Nn-i  ^  Nn) 

Note:  Nn-i  Nn  represents  the  two  nodes  (alpha  or  beta)  in  sequence  in  the  join  plan  of 
the  beta  node  where  the  token  was  initially  inserted. 

3.2.3  Cost  of  Beta  Nodes: 

The  cost  of  a  beta  node  consists  of  two  components: 
A  local  cost  noted  LocalCost. 
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-    The  cost  for  children  nodes. 
Cost(  p)  =  LocalCost(  p)  +     ^  Cost(N) 

Nechildren(P) 

LocalCost( P)  =     2 {Fi(N)  *  InsertCost (N,  p)  +  Fd(N)  *  DeleteCost  (N, p)} 

N€children(P) 

The  LocalCost  represents  the  cost  incurred  to  update  the  beta  node  itself.  This  cost  has 
two  components:  insertion  and  deletion  costs  related  to  the  beta  node.  Each  of  these  two 
components  is  computed  incrementally  with  respect  to  each  of  the  children  of  the  beta 
node.  Each  elementary  contribution  relative  to  the  current  child,  is  considered  relatively 
to  the  "weight"  of  that  child,  which  is  represented  by  the  corresponding  insertion  and 
deletion  frequencies  Fi  and  Fj.  Below  is  a  high-level  pseudo  code  description  of  the 
InsertCost  and  DeleteCost  procedures. 

Considering  the  current  child  N  of  node  P,  the  following  procedure  gives  the  participation 
of  N  in  the  insertion  cost  of  p. 
InsertCost  (N.  p): 
3egin 

TRsize  =  1 
TempCost  =  0 

For  each  node  n  in  the  join  order  plan  of  N 

TempCost  =  TempCost  +  Cj(TRsize,n) 

TRsize  =  TRsize  *  Sel(ThisNode  ^NextlnPlan) 

EndFor 

updateCost  =  [  ]  *  2  *  I/Owdght 

tuplesPerPage{/3) 

return  (TempCost  +  updateCost) 

End 

Similarly  to  insertion  costs,  deletion  costs  are  given  by  the  following  procedure  for  the 
current  child  N  of  a  node  p.  The  Yao(Pages(p),  TRsize)  function  appearing  in  the 
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procedure  DeleteCost  below  simply  estimates  the  number  of  pages  of  (3  that  will  be 
touched  when  TRsize  tokens  are  (randomly)  selected  from  the  node.  The  exact  definition 
of  the  function  Yao  as  well  as  estimates  of  it  is  given  by  Bernstein  et  al  [BerSl]. 
DeleteCost  (N.  (3): 
Begin 

TRsize  =  1 

For  each  node  n  in  join  order  plan  of  N 

TempCost  =  TempCost  +  Cj(TRsize,n) 

TRsize  =  TRsize  *  Sel(ThisNode  ->NextInPlan) 
EndFor 

updateCost  =  [Yao(Pages(p),  TRsize)  +  Pages(p)]  *  I/Owdght  + 
Trsize  *  Card(p)  *  CPUweight 
Return  (updateCost) 

End 

As  mentioned  earlier  in  this  chapter,  a  more  detailed  description  of  the  cost 
formulas  and  procedures  can  be  found  in  Hanson's  papers  on  the  optimization  of  Gator 
networks  [Han96b,  Han97b].  The  reduced  version  outlined  in  the  previous  pages  is 
however  sufficient  for  the  purposes  of  this  work.  In  the  subsequent  chapter,  we  will 
introduce  the  work  accomplished  for  selection  predicate  placement  in  query  optimization 
and  relate  it  to  the  selection  predicate  placement  problem  in  discrimination  networks. 


41 


CHAPTER  4 

RELATED  WORK:  INADEQUACY  OF  THE  RANK  METRIC. 

Several  research  projects  and  related  articles  have  addressed  the  issue  of  optimal 
selection  predicate  placement  among  joins  in  a  query  tree.  The  LDL  algorithm  was  first 
proposed  in  [Chi89]  then  later  adapted  for  use  with  object-relational  models  in  [Yaj91]. 
In  this  approach,  the  selection  predicate  is  passed  to  the  query  optimizer  as  a  relation  and 
treated  as  such  during  the  whole  optimization  phase.  Many  of  the  subsequent  algorithms 
use  the  notion  of  a  rank  that  is  assigned  to  each  of  the  predicates  (joins  and  selections)  in 
the  query  tree.  The  notion  of  a  rank  metric  was  explored  in  operations  research  [Mon79], 
as  well  as  database  issues  [Cha96,  Kri86].  After  assigning  the  ranks,  the  algorithms 

iterate  over  each  "leaf  to  root"  path  of  the  query  tree  and  reorganize  the  current  stream  in 

,        .       ■        .    r^,        ,        .  .    ,  ^    ,     .         ,     Selectivity  -  I 

ascendmg  order  of  the  rank  metric.  The  rank  metric  is  defined  to  be:  rank  =  

cost 

Without  a  formal  proof,  it  is  intuitively  easy  to  see  that  a  low  rank  will  be 
produced  by  both  a  small  selectivity  factor  (which  represents  a  very  selective  operator) 
and  cost  (i.e.  a  cheap  operator).  Non-selective  and  expensive  predicates  yield  higher 
ranks,  and  are  therefore  delayed  as  much  as  possible,  while  cheap  and  selective  predicates 
are  applied  earlier.  The  ordering  of  selections  and  joins  based  on  the  rank  metric  yields 
satisfying  plans  in  query  processing  [Cha96,  Hel92,  Hel93,  Hel95].  Yet,  it  appears  to  fail 
to  produce  consistent  results  when  applied  to  the  placement  of  selection  predicates  among 
joins  in  a  discrimination  network  structure.  Because  it  does  not  incorporate  important 
parameters  such  as  insert  and  delete  frequencies,  it  seems  that  the  rank  metric  is  not 
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"rich"  enough  to  capture  with  accuracy  the  cost  model  developed  for  Gator  networks.  As 
a  matter  of  fact,  it  is  fairly  easy  to  infer  from  the  cost  formulas  presented  earlier,  that 
different  distributions  of  token  frequencies  can  sensibly  change  the  cost  of  a 
discrimination  network,  and  thus  the  decision  of  moving  a  selection  predicate  or  not. 
Consequently,  the  results  given  by  a  rank  based  algorithm  do  not  always  represent  the 
correct  ordering  of  predicates  in  the  discrimination  network  tree.  Before  considering  a 
simple  example,  it  is  necessary  to  introduce  at  this  point  a  minor  extension  to  the 
proposed  Gator  cost  model,  as  well  as  the  basic  modeling  approach  assumptions  that  will 
be  used  throughout  this  chapter.  Later  in  the  chapter,  the  reader  will  find  a  short 
mathematical  analysis  of  the  same  example,  showing  how  various  distributions  of  token 
frequencies  can  impact  the  decision  surface  for  moving  selection  predicates. 

4.1   Extensions  to  The  Cost  Model  and  Assumptions: 

The  main  extension  to  the  Gator  networks  cost  model  will  be  to  define  a 
selectivity  or  "filtering  capacity"  for  edges  joining  a  child  to  its  parent  node.  We  will  refer 
to  this  property  as  the  "Edge  Selectivity  Factor"  and  denote  it  ESP. 
4. 1 . 1  The  Edge  Selectivity  Factor: 

hi  most  relational  systems  and  since  the  introduction  of  System  R  [Ast76],  joins 
have  been  heavily  modeled  with  a  Join  Selectivity  Factor  (JSF)  value  defined  as  the  ratio 
of  the  cardinality  of  the  result  of  the  join,  over  the  products  of  the  cardinalities  of  the 
base  relations  (or  nodes).  As  a  result,  the  value  of  the  JSF  is  a  rational  number  lying  in 
the  interval  [0,1].  Practical  values  for  the  JSF  fall  within  the  interval  (0,1)  since  the 
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values  0  and  1  represent  extreme  cases  where  the  join  result  is  either  empty  or  equal  to 
the  Cartesian  product  of  the  base  relations  (nodes).  Consider  the  following  join: 

JSF:  1/1000 


The  figure  above  gives  an  illustration  of  the  JSF  values  as  they  were  just  described.  The 
cardinalities  of  R,  S  and  J  are  respectively  1000,  100  and  100.  The  value  of  the  JSF  is 
taken  as  the  ratio  of  card(J)  divided  by  card(R)  *  card(S)  which  is  in  this  case  100  /  (100  x 
1000)  =  1/1000.  While  this  value  models  the  join  from  a  global  point  of  view,  it  does  not 
correctly  depict  the  participation  of  "each  side"  (R  and  S  here)  in  the  joining  operation  in 
terms  of  numbers  (or  percentages)  of  tokens  flowing  across  the  join  edges.  As  a  matter  of 
fact,  the  cardinality  of  node  J  is  not  one  thousandth  of  either  that  of  R  or  S.  Instead,  each 
token  or  tuple  from  S  has  a  match  in  J  (on  average),  while  only  about  one  tenth  of  the 
tuples  from  R  have  a  match  (on  average)  in  J.  Since  the  selection  predicate  is  always 
located  on  one  of  the  relations  participating  in  the  join  operator,  it  is  important  to 
correctly  model  the  selectivity  of  the  join  with  respect  to  each  of  the  relations 
participating  in  it.  The  Gator  network  cost  model  uses  the  notion  of  a  "directional" 
selectivity  factor  for  joins,  based  on  the  distribution  of  the  join  attribute  in  each  relation. 

Consider  a  join  condition  (from  the  figure  above)  of  the  form  R.x  =  S.x  where  "x" 
is  the  joining  attribute.  Under  the  uniform  distribution  assumption  for  attribute  x  as  it  is 
defined  by  Selinger  et  al  [Sel79],  the  selectivity  of  R^S  is  taken  as  the  ratio:  Sel(R->S)  = 
Caid(S)/Val[x/S] ,  where  Val[x/S]  represents  the  number  of  distinct  values  of  x  in  relation 


R  '  - ' 


Cardinality:  100 


J 


Cardinality:  100 
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S.  Symmetrically,  the  selectivity  S^R  is  given  by  Sel(S^R)  =  Caiti(R)  /  Val[x  /  R].  Surely, 
this  approach  gives  correct  results  under  the  specified  assumptions  of  uniformity. 
However,  it  does  suffer  from  a  number  of  limitations,  which  are  listed  below: 

-  The  hypothesis  that  a  token  joins  on  average  with  Caid(Relatic»i)/Val[x/ Relation]  tokens 
is  valid  for  equijoins.  But  when  considering  non-equijoins  such  as  R.x  >  S.x  or  R.x 
overlaps  S.x  (if  x  is  a  polygon  attribute),  this  approach  becomes  inadequate. 

-  When  considering  join  conditions  such  as  R.x  overlaps  S.x  -rather  common  in  the 
extensible  environment  for  which  TriggerMan  was  designed-  the  notion  of  distinct 
values  for  an  attribute  (such  as  a  polygon)  may  be  difficult  to  define  [Haa95]. 

Since  the  number  of  distinct  values  of  an  attribute  cannot  exceed  the  cardinality  of 

the  relation  (i.e.  the  attribute  is  a  key,  and  all  values  are  distinct),  the  ratio 

Card(Relation)  /  Val[x  /  Relation]  never  drops  below  1.  On  the  other  hand,  the 

selectivity  of  a  selection  predicate  is  always  in  the  interval  (0,1).  Recalling  the 

definition  of  the  rank=  ^^^^^^^^^^y  ~  ^  ^  ranks  of  joins  will  therefore  be  positive 

cost 

numbers  while  those  of  selection  predicates  will  remain  strictly  negative,  no  matter 
what  the  costs  of  the  predicates  are.  Consequently,  the  optimization  strategy  becomes 
useless  because  the  ordering  of  predicates  always  puts  selections  before  joins,  since 
the  ranks  of  joins  are  always  strictly  larger  than  those  of  the  selection  predicates. 
In  spite  of  the  fact  that  selection  predicate  placement  in  discrimination  networks 
will  not  be  directly  reusing  ranks,  the  general  idea  behind  the  heuristic  is  also  based  on 
the  possible  "good"  selectivity  of  joins,  allowing  a  reduction  in  the  number  of  times  an 
expensive  sigma  has  to  be  done.  If  the  cost  reduction  incurred  by  doing  the  (expensive) 
selection  predicate  "less  often"  surpasses  the  increase  incurred  by  doing  the  joins  more 
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often,  the  strategy  works  and  the  selection  predicate  is  replaced  in  the  Gator  network. 
However,  If  the  selectivity  factor  of  a  join  is  always  greater  or  equal  to  1 ,  the  "traffic" 
(percentage  of  tokens  flowing  along  an  edge)  is  never  reduced  after  that  join  operation  is 
executed,  and  the  optimization  fails  in  all  cases.  Therefore,  the  Gator  network  model 
needs  to  be  extended  to  represent  joins  that  can  have  selectivity  factors  in  the  [0,1] 
interval,  which  are  the  joins  of  "interest"  as  far  as  the  selection  predicate  placement  is 
concerned.  Going  back  to  the  previous  example  with  relations  R  and  S  joining  into  J,  the 
join  operation  resulting  in  the  J  node  can  be  viewed  in  two  symmetrical  ways: 
-    Start  off  with  node  R  empty  and  node  S  full,  and  gradually  insert  tokens  in  R  until  it 
gets  full.  At  that  point,  the  cardinality  of  the  node  J  will  have  reached  100  and  that  of 
R  will  be  1000.  On  average,  each  token  inserted  in  R  will  have  contributed  with  0.1 
tokens  on  the  edge  R^J,  and  one  out  of  every  ten  tokens  inserted  in  R  will  have 
joined  if  each  joining  token  finds  a  single  match. 

Start  off  with  node  R  full  and  node  S  empty,  then  gradually  fill  up  S  until  it  reaches 
its  cardinality  of  100  tokens.  Since  the  join  operation  is  independent  of  the  joining 
order  (R  join  S  =  S  join  R),  the  resulting  joining  node  is  exactly  identical  to  the  one 
obtained  in  the  first  approach.  In  this  case,  each  token  inserted  in  S  will  have 
contributed,  on  average,  with  one  token  on  the  edge  S->J.  (see  figure  below). 


JSF:  1/1000 


Cardinality:  1000  R 


Cardinality:  100 


Edge  Selectivity:  0.1 


Cardinality:  100 


Edge  Selectivity:  1 
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The  selectivity  of  joins  will  therefore  be  defined  with  respect  to  each  input  stream 
to  the  join.  The  value  of  that  selectivity  represents  the  "filtering"  capacity  of  the  join  with 
respect  to  the  corresponding  base  node;  i.e.  what  proportion  of  tokens  is  generated  along 
that  edge  for  each  token  coming  into  the  base  node.  The  value  0.1  means  as  described 
above,  that  for  each  token  inserted  in  R  (whether  it  does  join  or  not),  in  average  0.1 
tokens  will  be  generated  down  the  edge  R->J.  In  other  words,  the  edge  R-^J  divides  the 
traffic  coming  into  R  by  a  factor  of  ten. 
Remarks: 

-  The  Edges  Selectivity  Factors  (referred  to  as  ESFs)  will  be  represented  as  shown 
above,  directly  on  the  edge. 

-  ESF(R-^J)  is  computed  as  JSF(R<-^S)  *  Card(S)  and  symmetrically  ESF(S^J)  is 
computed  as  JSF(R<-4S)  *  Card(R).  Since  Card(J)  is  defined  as  JSF(R<-^S)  *  Card(R) 
*  Card(S),  it  follows  that  ESF(R-^J)  =  Card(J)  /  Card(R)  and  ESF(S^J)  =  Card(J)  / 
Card(S) 

The  ESF  is  not  confined  to  the  interval  [0,1]  (or  (0,1) )  as  JSF  values  are. 

4. 1 .2  Predicates  Selectivities  and  Costs: 

The  standard  assumptions  of  query  optimizers  about  the  accuracy  of  database 
statistics  (estimates  of  parameters  values)  and  the  independence  of  the  selectivity  factors 
of  predicates  will  be  made  in  the  remainder  of  this  work. 
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The  results  on  the  existence  of  a  constant  differential  (per-tuple)  cost  for  all  the 
well-known  join  algorithms  [Hel92,  Hel95]  will  be  reused  in  this  work  when  describing  a 
rank  based  operation: 

The  partial  differential  costs  of  joins  with  respect  to  their  base  nodes  are  constants  for  all 
the  well-known  join  algorithms,  and  hence  the  cost  of  a  join  per  tuple  of  each  input  is 
typically  well  defined  and  independent  of  the  cardinality  of  either  input. 

4.2   Allee-Gator  Example: 

Given  the  definition  of  the  edge  selectivity  factor,  let  us  consider  a  practical 

example.  Doctor  Guy  Thorfan  from  the  wildlife  department  of  the  University  of  Florida 

is  currently  working  on  a  research  project  involving  fauna  observation  in  multiple 

locations  around  the  state  of  Florida.  One  of  his  concerns  is  the  presence  of  GATORS  in 

the  lake  AUee  area  in  Alachua  County.  Data  is  gathered  as  follows:  graduate  students 

using  laptops  and  state  of  the  art  GPS  technology,  continually  observe  the  areas  of 

interest  and  type  in  their  "sightings"  that  are  immediately  sent  back  to  the  central 

database.  In  order  to  spot  the  presence  of  GATORS  around  Lake  Allee,  Dr  Thorfan  writes 

the  following  trigger  against  the  database  of  the  department: 

Create  Trigger  Allee-Gator 

From  Fauna_Sightings,  InterestArea 

Where  hiterestArea.Box  contains  Fauna_Sigthings.Location 
And  Fauna_Sightings.Animal  =  "Gator" 
And  InterestArea.User  =  "Thorfan" 

The  Fauna_Sightings  table  is  defined  as  Fauna_Sightings(ID,  Animal 
varchar(255),  Location  point),  and  a  listing  of  its  data  is  given  in  Appendix  A.  The  Lake 
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Allee  area  of  interest  is  supposed  to  be  geographically  represented  with  the  box  delimited 
with  the  comer  points  (0,0)  and  (1,1)  in  a  predetermined  system  of  coordinates  and  we 
will  suppose,  for  simplicity  reasons  but  without  loss  of  generality,  that  the  selection  "User 
=  Thorfan"  returns  only  that  box.  The  InterestArea  table  is  also  shown  in  Appendix  A. 
All  nodes  in  the  network  are  considered  to  be  stored  without  any  indexes  defined  on  any 
of  the  attributes.  Also,  the  database  will  be  supposed  to  be  insert-only  so  that  delete 
frequencies  can  be  ignored  in  the  context  of  this  example. 
The  discrimination  network  representing  this  rule  is  described  below: 

Cost:  10 

/ 

Selectivity:  0.6  <7l  Animal  =  "Gator-  ^2  Usei="Thorfan" 

tti. Box  contains  tti. location 

Cardinality:  12  ai-^-^'  2o  C2:50'^2   Cardinality:  1 

^^/12\^^^^/'^^ 

P   Cardinality:  1 

Node  tti  is  based  on  the  relation  Fauna_Sightings  and  has  a  selection  predicate  (5\ 
"Animal  =  Gator"  defined  on  it.  The  selectivity  of  <3\  is  represented  by  the  number  of 
tuples  where  "Animal  =  Gator"  (displayed  with  a  gray  shaded  background  in  Appendix 
A)  over  the  cardinality  of  Fauna_Sightings:  12/20  =  0.6.  Also,  the  selection  is 
supposed  to  return  one  unique  row  from  table  InterestArea  with  a  Box  value  of  (0,0,1,1). 
The  selectivity  of  C2  is  not  critical  in  the  context  of  this  example;  therefore  we  will  give  it 
an  arbitrary  value  of  0.1.  The  selection  predicate  G\  and  the  join  operations  between  tti 
and  tti  have  been  assigned  differential  cost  values  indicating  the  ratio  of  the  complexity 
of  one  operation  with  respect  to  the  other  in  some  predefined  unit. 
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The  rank  of  a\  is  defined  as:   rank(ai)  = 

Therefore  rank  (Oi)  =  Mzii  ^  _o.04 

10 


Sel(ai)-1 
Cost(ai) 


The  result  of  the  join  of  ai  and  ai  is  a  one-row  table: 


Animal 

Location 

User 

Box 

Gator 

(1,1) 

Thorfan 

(0,0,1,1) 

And  the  selectivity  of  the  join  with  respect  to  tti  is  then  1/12  =  0.083. 

Therefore,  rank(Join/ai)  =  Q  Q^^  ~  ^  =  -0.0458 

20 

Representing  both  ranks  on  a  linear  axis,  we  can  see  that  the  rank  of  the  join  is  strictly 
smaller  than  that  of  the  selection  predicate. 


Join/a  1  Gi 
 1  h- 


-0.046   -0.04  0  ranks  axis 

And  according  to  the  rank  ordering  of  predicates,  the  selection  predicate  C\  should  be 
pushed  down  to  produce: 


True 


a,  - 
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Let  us  now  reconsider  the  same  network  with  insert  frequencies  assigned  to  each 
input  edge  and  checlc  the  results  given  by  the  discrimination  network  cost  formulas: 


If  C  is  the  initial  cost  of  the  network,  let  AC^  and  AC"  represent  respectively  the  increase 
and  decrease  components  that  occur  in  C  when  the  G\  operator  is  pushed  down. 
The  AC  part  of  the  variation  of  C  will  have  three  components: 

-  The  increase  in  the  frequency  of  execution  of  the  join  operation  between  tti  and 
for  each  token  inserted  in  tti. 

-  The  increase  in  the  frequency  of  insert  operations  applied  to  node  ai.  That  value  will 
be  represented  as  AT. 

-  The  increase  in  the  differential  cost  C2  of  joining  tokens  from      to  ai  (since  tti  is 
larger)  noted  AC2.  (AC2  >  0). 

AC~  will  be  the  decrease  in  the  frequency  of  execution  of  Gi  (if  any). 

When  Oi  is  pushed  down,  some  of  the  parameters  of  the  network  do  change,  as  follows: 

-  The  cardinality  of  a\  grows  from  12  to  20  tuples  (or  tokens). 

-  The  resulting  join  of  tti  with  tti  becomes  the  following  table 
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Animal 

Location 

User 

Box 

Gator 

(1,1) 

Thorfan 

(0,0,1,1) 

Crow 

(1,1) 

Thorfan 

(0,0,1,1) 

Frog 

(1,1) 

Thorfan 

(0,0,1,1) 

And  the  join  selectivity  factor  becomes  3/(20  *  1)  =  0.15.  The  selectivity  of  edge  (tti^P) 
becomes  0.15  *  1  =  0.15  and  that  of  edge(a2^p)  becomes  0.15  *  20  =  3. 


P 

The  values  for  the  increase  and  decrease  in  the  global  cost  of  the  network  are  then: 
AC  :  (0.8  -  0.48)  *  20  +  AT  +  0.02  *  AC2  =  0.32  *  20  +  Af  +  0.02  *  AC2 

=  6.4  +  AT  +  0.02  *  AC2 
AC":  [0.8 -(0.8*  0.15 +  0.02*  3)]  *  10  =  (0.8  -  0.18)  *  10  =  0.62*  10  =  6.2 
Since  AC*  is  greater  than  AC  (AT  and  AC2  are  strictly  positive  terms),  the  operation  of 
pushing  the  selection  predicate  down  increases  the  cost  of  the  network  and  should  not  be 
done.  The  80/20  distribution  of  tokens  is  therefore  not  favorable  to  a  pushdown  of  the 
selection  predicate  ai. 
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Consider  now  the  same  network  with  a  different  distribution  of  insert  frequencies: 


P  I 

P 

In  this  case  we  have  voluntarily  biased  most  of  the  insert  traffic  towards  the  edge 
containing  the  selection  predicate  G\.  The  pushing  of  C\  transforms  the  network  as 
described  in  the  figure  above.  The  values  of  AC*  and  AC"  are  given  below: 
AC^  :  (0.99  -  0.594)  *  20  +  AT  +  0.001  *  AC2  =  0.396  *  20  +  AT  +  0.001  *  AC2 

=  7.92  +  AT  +  0.001  *  AC2 
AC:  [0.99 -(0.99*  0.15 +  0.001  *  3)  ]  *  10  =  (0.99  -  0.1515)  *  10  =  0.8385  *  10  =  8.38 
Clearly,  AC*  could  be  smaller  than  AC  in  this  case.  If  At  and  AC2  are  small  enough 
and/or  can  be  neglected,  then  the  selection  predicate  pushdown  operation  improves  the 
cost  of  the  network  and  should  be  executed. 

This  example  pointed  out  at  how  the  distribution  of  insert  frequencies  alone  can 
affect  the  decision  surface  regarding  the  proper  placement  of  selection  predicates  in 
discrimination  networks.  In  the  following  section,  the  reader  will  find  a  generalization  of 
this  example  and  a  short  analysis  of  the  behavior  of  cost  formula-based  decision  surfaces 
with  respect  to  distributions  of  tokens  frequencies. 
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4.3   Generalization  of  the  Cost  Considerations: 


Fi  F2 


Pnode 

Figure  4-1  Analysis  of  selection  predicate  pushdown 

Let  us  consider  in  this  analysis  (without  loss  of  generality)  an  insert-only 
discrimination  network,  with  two  alpha  nodes  joining  into  a  Pnode  as  described  on  the 
left  side  of  Figure  4-1  above.  It  will  be  considered  in  the  context  of  this  study,  that  there 
is  no  index  defined  on  any  attribute,  as  adding  indexes  will  increase  the  complexity  of  the 
cost  formulas  without  having  any  impact  on  the  results  of  the  analysis.  (The  reader  will 
find  an  elaboration  on  this  aspect  in  Chapter  5,  section  5.2.3). 

Fi  and  F2  are  the  frequencies  of  tokens  arriving  respectively  into  the  left  and  right 
input  edges  of  the  network.  The  selectivities  of  selection  predicates  a\  and  C2  are  denoted 
Sa,  and  802.  The  edge  selectivity  factors  of  edges  (ai^p)  and  (tta^P)  are  represented  by 
Sj,  and  Sj2.  (Recall  that  Sj,  and  Sj2  are  defined  as  JSF12  *  Card(a2)  and  JSF12  *  Card(ai) 
respectively).  Finally,  ai  has  been  assigned  a  cost  of  C0,  and  the  differential  costs  of  the 
join  between  tti  and  a2  are  denoted  Q,  and  Q2. 
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By  definition  of  the  rank  metric,  the  rank  of  a\  is:  rank(ai)  = 


Coi 


Si  —  1 

And  the  rank  of  the  join  with  respect  to  ai  is  defined  as:  rank(J/ ai)  =  — 

The  rank  based  heuristic  orders  the  predicates  in  ascending  order  of  the  rank 
metric,  and  will  therefore  decide  to  perform  a  PUSHDOWN  when  the  rank  of  the  join  is 
smaller  than  the  rank  of  the  selection  predicate  Gi:  rank(J/ai)  <  rank(ai) 
Replacing  the  expressions  of  rank(J/ai)  and  rank(ai)  with  their  respective  values  as  they 

were  defined  above,  we  can  rewrite: 

rank(J/  ai)  <  rank(  ai)  <=>        ^  <  — — - 

Cji  Coi 

Then  by  multiplying  each  side  by  Cji  and  dividing  by  (Soi-  1),  we  get 

C 1     Si  —  1 

<t=>       <  —        (reversing  the  comparison  since  Soi  - 1  <  0) 

Col      Sol  —  1 

Which  can  be  rewritten  as: 

o  Cji  <^!^*  Coi  (1) 

l-Soi 

The  rank  heuristic  will  therefore  perform  a  pushdown  when  the  condition  (1)  above  is 

1  -  S'l 

true  and  not  perform  the  pushdown  when  Cji  >          *  Coi 

1-Soi 

Let  us  now  consider  the  relocation  of  the  selection  predicate  ai  above  the  Pnode 
from  the  aspect  of  the  Gator  network  cost  formula.  After  pushing  down  the  selection 
predicate  Oi,  we  get  the  discrimination  network  described  on  the  right  side  of  Figure  4-1. 
Since  the  values  of  JSF12  and  So,  remain  unchanged  under  the  assumption  of 
independence  of  the  selectivity  of  predicates,  the  edges  selectivity  factors  values  become 
Sj,  and  Sj2'  (Sj,  =  JSF12  *  Card(a2),  which  is  unchanged,  and  Sj,'  =  JSF12  *  Card(ai), 
which  is  changed).  The  node  ai  now  receives  an  input  stream  of  frequency  Fi  instead  of 
Fi*  So,.  The  frequency  of  tokens  now  going  through  ai  is  given  by  (Fi  *  Sj,  +  F2  *  S02  * 
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Sj,')  where  Fi  *  Sj,  is  the  contribution  from  the  left  side  of  the  join,  and  Ft  *  So:  *  Sj/  the 
contribution  from  the  right  side. 

Reusing  the  same  notations  to  describe  the  variations  in  the  cost  of  the  Gator 
network,  let  AC"^  and  AC"  represent  respectively  the  increase  and  decrease  components  in 
the  cost  C  of  the  network  when  Oi  is  pushed  down  as  shown  in  Figure  4-1. 
AC"^  has  three  components: 

The  increase  in  the  number  of  times  the  join  from  ai  to  a2  is  executed. 
The  increase  in  the  number  of  insertions  into  node  ai . 
-    The  increase  in  the  cost  of  joining  tokens  from     to  ai :  ACj2  (positive  quantity). 
The  value  of  AC"*^  is  therefore: 

AC" :  (F,  -  Fi  *  Sa.)  *  Cj,  +  (F,  -  Fi  *  So,)  *  (CPU^eigh,  +  2  *  I/Owdght)  +  F2  *  So,  *  AQ, 
Since  (CPUweight  +  2  *  I/Oweight)  is  the  insertion  cost  into  an  alpha  node  without  indexes. 
(The  reader  may  refer  to  chapter  3  and  [Han92]  for  more  details  on  the  cost  formulas). 
Replacing  the  quantity  (CPUweight  +  2  *  I/Oweight)  by  "Ic"  to  simplify  our  notations,  we  can 
rewrite  AC"^  as  follows: 

AC" :  =  Fi  *  (1  -  Sa,)  *  [Q,  +  (CPUwdgh.  +  2  *  I/Oweight)]  +  F2  *  Sa,  *  AQ,  = 

=  Fi  *  (1  -  Sa,)  *  [Cj,  +  Ic]  +  F2  *  So:  *  ACj2 
Since  the  selectivity  factor  Sa,  is  a  value  strictly  smaller  than  1,  it  follows  that  Fi  *  Sa,  is 
also  strictly  smaller  than  Fi,  and  therefore  AC^  is  always  a  positive  number.  (AC*  >  0). 
The  decrease  in  the  cost  of  the  network  C  lies  in  the  (possible)  reduction  of  the  frequency 
of  execution  of  the  operator  C\ : 

AC"  =  [Fi  -  (Fi  *  Sj,  +  F2  *  Sa,  *  Sj,')]  *  Ca, 
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In  order  to  analyze  the  influence  of  the  distribution  of  token  frequencies  over  the 
pushdown  heuristic,  let  us  consider  that  Fi  =  m  *  F2  with  me  Q"^*  (Q^*  being  the  set  of 
positive,  non-null  rational  numbers). 
Replacing  F2  by  F|/m  we  can  rewrite  AC~ 

o  AC"  =  [Fi  -  (Fi  *  Sj,  +  ^  Sc2  *  Sj2')]  *  Ca, 

m 

Then  by  factoring  out  Fi  and  using  m  as  a  common  denominator 

o  AC"  =Fi*(l-Sj,-  *^''VCa. 

m 

^AC"=F.*("^"^^^'*^"-S.)*Co. 
m 

The  PUSHDOWN  operation  will  be  beneficial  when  the  decrease  in  the  network  cost  C  is 

strictly  greater  than  the  increase:  0  <  AC"^  <  AC". 

After  replacing  AC"^  and  AC"  by  their  respective  values,  we  can  rewrite: 

O  Fi  *  ( 1  -  Sa,)  *  [Q.  +  Ic]  +  —  *  S02*  ACj2  <  Fi*("""^^''*^°'  -  Sj, )  *  Cm 

m  m 

And  after  simplifying  by  Fi  (non-null)  on  both  sides  of  the  inequality: 

«  (1  -  S.)  *  [Q,  +  Ic]  +  -^^i^  <  ^}IL^>L^ .  s. )  *  Co, 

m  m 

Rearranging  terms  and  dividing  each  side  of  the  inequality  by  (I-S0,),  we  obtain 

m-Sj2'*So2  ^ 

oQ,<  21  !!*Ca,--^^i^^  -Ic  (2) 

l-So,  m(l-SaO 

Therefore,  the  selection  predicate  Ci  has  to  be  pushed  down  when  the  inequality  (2) 

above  is  true  and  will  not  be  pushed  down  when  the  same  inequality  is  reversed.  To 

So^  *  AC  2  O 

simplify  the  notations,  we  will  rewrite  the  quantity  —  —  as  —   (Q  >  0). 

m(l-So,)  m 
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Table  1  Conditions  for  pushing  down  a  selection  predicate 


PUSHDOWN 


NO  PUSHDOWN 


Rank 
Based 


1-Sc, 


Q,>i:i^*Ca, 

l-Sa, 


Cost  formula 
Based 


m-Sj2'*Sa2 


-Sj, 


m  -Sj2'*So2 


Cj,< 


m 


1-Sa, 


*Ca,-  — -Ic 


m 


m 


l-Sm 


*Ca,-  — -Ic 


m 


Table  1  above  recapitulates  the  results  of  this  analysis  so  far.  Setting  the  insertion 
cost  Ic  aside  temporarily,  one  can  notice  that  the  limit  of  the  cost  formula  based  criterion 
(when  m  is  very  large:  m^<»)  becomes  the  rank  criterion.  While  a  more  precise  analysis 

is  required  to  reach  meaningful  conclusions,  this  remark  gives  an  initial  picture  on  the 
following  issues: 

-  What  is  the  influence  of  the  distribution  of  token  frequencies  (recall  that  different 
values  of  m  represent  different  distributions  of  token  frequencies)  over  the  decision 
for  pushing  down  a  selection  predicate  using  the  cost  formula  based  approach? 

-  How  do  the  rank  and  cost  formula  based  approaches  relate  to  each  other,  and  how 
can  different  token  distributions  affect  this  relationship? 

In  order  to  gain  a  better  understanding  of  the  two  points  raised  above,  it  is 
necessary  to  proceed  at  this  point  of  the  analysis  with  a  study  of  the  function  f(m)  that  is 


defined  below: 

m-Sj2'*So2 
Letf(m)  =- 


m 


-Sj> 


1-Sa, 


m*Co, 


With  Sj„  Sj2',  Q,  Ca,  e  R**  and  Sa„  Saj  e  (0,1) 
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f(m)  is  simply  the  cost  formula  based  criterion  where  both  sides  have  been  divided  by  the 
constant  Co,  and  the  insertion  cost  Ic  set  aside  (temporarily).  The  remaining  part  of  this 
chapter  will  cover  a  short  mathematical  analysis  of  the  function  f(m),  followed  by  a  set  of 
comments  on  the  corresponding  results. 

4.3.1  Influence  of  Token  Distributions  on  Decision  Surfaces  for  Pushing  a  Sigma: 

As  mentioned  above,  a  brief  study  of  the  function  f(m)  will  show  the  effects  of  m, 
(i.e.  the  distribution  of  token  frequencies)  over  the  decision  surface  for  pushing  selection 
predicates  in  the  discrimination  network,  as  well  as  its  connection  with  the  rank  based 
ordering  approach. 
Domain  of  definition  of  f(m): 

The  domain  of  definition  of  f(m)  Df  is  R"^*,  i.e.  m  e  (0,  +0°) 
Limits  of  f(m): 

The  limits  of  f(m)  are  as  follows: 

Limf(m)  )-°o 

Lim  f(m)  ) 

1-Sa, 

Continuity  of  f(m): 

f(m)  is  continuous  over  its  domain  of  definition  Df. 
Variations  of  f(m): 

The  variations  of  f(m)  are  given  by  the  sign  of  its  first  derivative  over  the  domain  of 
definition  Df.  Since  the  derivative  of  Sj,/(1  -  Sa,)  with  respect  to  m  is  null,  the  first 
derivative  of  f(m)  with  respect  to  m  is  given  by  the  following  term: 
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m-V*So:  Q 
af(m)  _    m*(l-So,)  m*Coi 
3m  dm 

The  quantity  above  being  the  difference  of  two  ratios,  it  can  be  developed  into: 

-i-— ^  *  [m  *  ( 1  -  So,)]  -   *  (m  -  Sj:  '*  SaO    ai^TT ] 

dt(m)  3ni  3m   m*Coi 

^  y    —      

3m  [m*(l-So,)f  3m 

The  values  of  the  derivatives  of  each  term  with  respect  to  m  allow  us  to  rewrite  it  as: 
3f(m)     m*(l-Sa,)-(l-Sa,)*(m-Sj2'*So2)  Q 

^  — I  =  1" 


3m  [m*(l-So,)f  Ca,*m^ 

Factoring  out  (1  -  So,)  in  the  numerator  of  the  first  term 

3f(m)    (l-Sa,)*[m-(m-Sj2'*Sa2)]  Q 
 =  72  + 


3m  [m*(l-So,)]  Co,*m^ 

And  simplifying,  we  finally  get 

3f(m)    (l-Sa,)*Sj2'*So2  Q 
^  —  =  — I- 


3m        [m*(l-Sa,)f  Coi*m^ 

— —    is  defined  over  the  domain  of  definition  Df.  It  is  the  sum  of  two  positive 
3m 

quantities.  Indeed,  the  numerator  of  the  first  quantity  is  the  product  of  three  strictly 
positive  terms,  and  the  denominator  is  a  square.  Similarly,  Q  is  also  a  positive  quantity 
and  so  is  the  denominator  Co,  *  m^.  Consequently,  the  derivative  is  strictly  positive  over 
Df,  and  it  follows  that  f(m)  is  a  monotonically  strictly  growing  function  of  m.  The  table 
below  recapitulates  the  results  of  the  analysis  of  f(m)  obtained  so  far. 
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Table  2  Analysis  of f(m) 


m 

0 

+00 

af(m) 
3m 

+ 

 _^  1-Sj, 

f(m) 

-OO  — ■  ■ 

 1-Sa, 

For  values  of  m  ranging  from  0  to  infinity,  the  first  derivative  remains  strictly 

1-Sj, 


positive  and  the  values  of  f(m)  will  range  (grow)  from  negative  infinity  to  the  ratio 


1-So, 


(which  is  the  rank  criterion).  As  described  in  Table  1,  the  function  f(m)  represents  a 
decision  surface  (or  the  variable  constituent  of  a  decision  surface,  if  Ic  is  taken  into 
picture)  for  the  selection  predicate  push  down  criterion.  A  good  way  to  visualize  the 
effect  of  parameter  m  over  that  decision  surface,  is  to  geometrically  represent  f(m),  and 
observe  how  various  values  of  m  can  affect  it.  The  geometric  visualization  of  f(m)  will  be 
the  focus  of  the  next  section  of  this  chapter. 


4.3.2  Geometric  Interpretation  of  the  Analysis  of  f(m}: 

Let  us  consider  f(m)  as  a  family  of  two-dimensional  surfaces  (denoted  Zm) 
defined  in  a  3D  space.  Each  member  of  Zm  is  a  2D  surface  and  will  be  denoted  Sn,(Sj,, 
S(ji)  and  defined  as: 


m-Sj2'*Sa2 


Sm(Sji,  Sa,)2  = 


Sj. 


m 


So2*ACj2 


l-Sc, 


m*(l-Sm)*Ca, 


2  In  order  to  simplify  the  notations,  Sm(Sj„  Sa,)  and      will  refer  to  the  same  element. 
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Xm  will  therefore  consist  of  a  continuous  set  of  two-dimensional  surfaces  ranging  from 
Sm^o  (denoted  So  by  convention  even  though  So  is  not  defined),  to  Sm^  (denoted  S„  by 
convention).  Disregarding  the  actual  shape  of  the  surfaces  for  the  purpose  of  this  analysis, 
an  Sa  (arbitrary  value  of  m  =  a)  can  be  conceptually  represented  as  in  Figure  4-2  below: 


For  each  value  of  Sj,  and  Sa,  (representing  the  X  and  Y  coordinates  axis,  and 
ranging  in  their  respective  intervals  of  definition),  the  Z  axis  value  is  taken  as  Sa(Sj,,  So,) 
and  represents  the  point  on  the  surface  Sa- 

From  the  analysis  of  f(m)  in  section  3.3.1,  the  Z  axis  position  of  any  Sj  is  strictly 
above  that  of  any  Sk  when  j  >  k,  for  given  values  of  Sj,  and  So,.  (The  reason  being  the  fact 
that  f(m)  is  monotonically  growing  with  respect  to  m,  see  Table  2).  The  highest  of  all 
surfaces  is  therefore  S„,  defined  by  the  limit  of  Sn,  when  m  becomes  very  large,  and  the 
lowest  So,  or  rather  the  limit  of  S^  when  m  becomes  very  small,  since  the  function  is  not 
defined  when  m  is  equal  to  0.  A  visual  representation  is  given  in  Figure  4-3  below: 


Z  =  Sm(Sji,  Soi)  A 


Figure  4-2  Representation  of  a  surface  S, 


'm 
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 ► 


Figure  4-3  Relative  positions  of  Sm  for  different  values  of  m 

In  Figure  4-3  above,     is  drawn  in  black  and  So  is  transparent  (dotted  line).  The 

gray  surfaces  in  between  represent  two  values  j  and  k  (j  >  k)  of  m,  between  0  and  infinity. 

Recalling  the  results  of  Table  1  (still  leaving  aside  the  component  Ic),  the  condition  for 

pushing  down  a  selection  predicate  using  the  cost  formula  based  approach  can  be 

expressed  as: 

m-Sj2'*So2  g 

—  <  ^  9—   which  is  -^<Sn,(Sj,,Sa,) 

Cai  1-Sai  m  *  Coi  Coi 

In  other  words,  the  ratio  defined  by  Cj,/Ca,  (a  strictly  positive  number)  has  to  be 
located  under  the  surface  Sn,  for  the  pushdown  operation  to  improve  (reduce)  the  cost  of 
the  Gator  network. 

The  condition  for  pushing  down  the  selection  predicate  using  the  rank  based 
approach  can  be  expressed  as: 
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— ^<  —    which  is  — ^  <  Soo(Sji,  Soi)  =  Srank 

Coi       1  -  Sal  Col 

In  other  words:  the  ratio  defined  by  Cj/Ca,  has  to  be  located  under  the  surface  S„ 

which  happens  to  be  the  decision  surface  for  the  rank  Srank- 

Consider  now  Figure  4-3  again:  the  value  of  the  ratio  Cji/Co,  is  such  that  it  is 
located  under  but  above  Sj  and  S^;  which  represents  two  cases  (two  values  of  m) 
where  the  rank  based  approach  would  erroneously  perform  a  pushdown  operation.  As  a 
matter  of  fact,  both  those  values  of  m  (m  =  j  and  m  =  k)  represent  cases  where  the 
pushdown  operation  would  induce  an  increase  in  the  cost  of  the  network.  However,  there 
exists  values  of  m,  (let  one  of  those  values  be  i,  i  >  j  >  k)  such  that  the  cost  formula  based 
surface  gets  close  enough  to  S_  (which  is  also  Srank)  for  the  pushdown  operation  to  reduce 
the  cost  of  the  network.  In  that  case,  (illustrated  in  Figure  4-4  below),  the  rank  based 
approach  gives  a  correct  reordering  of  the  predicates  in  the  Gator  network. 


Figure  4-4  Addition  of  surface  Si 
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4.3.3  Qualitative  Interpretation  of  the  Analysis  of  f(m): 

The  analysis  of  the  function  f(m)  showed  that  a  cost  formula  based  decision 
becomes  equivalent  to  a  rank  based  decision  when  the  distribution  of  tokens  is  very 
largely  biased  towards  the  edge  from  which  the  selection  predicate  is  being  pushed.  This 
mathematical  result  can  be  validated  with  the  following  qualitative  reasoning: 

The  rank-based  approach  relies  exclusively  on  the  rank  of  the  selection  predicate 
and  the  rank  of  the  join  with  respect  to  the  current  base  node  to  make  a  pushdown 
decision.  In  our  example  from  Figure  4-1,  the  participation  of  the  right  side  of  the 
network  (tokens  coming  through  aa)  is  ignored  in  the  process  of  considering  the 
pushdown  of  (J\  with  the  rank-based  approach.  On  the  other  hand,  the  cost  formula  based 
approach  does  take  into  consideration  the  participation  of  both  sides  of  the  network  into 
the  pushdown  decision.  Since  very  large  values  of  the  parameter  m  represent 
configurations  where  the  totality  of  the  tokens  pass  through  Fi,  they  also  represent 
configurations  where  the  right  side  of  the  network  does  not  participate  in  the  decision  for 
pushing  down  the  selection  predicate  ai.  On  the  other  hand,  smaller  values  of  m  describe 
configurations  where  the  right  side  of  the  network  has  a  more  substantial  impact  on  the 
decision  for  performing  the  pushdown.  Consequently,  we  can  validate  our  mathematical 
results  showing  a  strong  similarity  between  rank-based  and  cost  formula-based  decisions 
for  large  values  of  m,  since  they  correspond  to  configurations  where  the  participation  of 
F2  is  negligible  or  null,  which  is  the  underlying  assumption  of  using  a  rank-based 
strategy. 
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4.4  Conclusion: 

The  rank  based  decision  surface  represents  in  this  case  an  upper-Hmit  to  the  cost 
formula  based  decision  surface,  that  is  reached  when  parameter  m  is  very  large.  (Large 
values  of  m  represent  cases  where  the  token  distribution  is  extremely  biased  towards  the 
input  edge  where  the  selection  predicate  is  located).  In  such  configurations,  the  rank  and 
cost  formula  based  approaches  are  likely  to  reach  similar  decisions.  However  when  m 
drops  towards  values  balancing  the  token  frequencies  among  the  input  edges,  or  even 
lower  values  reversing  the  bias  in  token  distributions,  the  cost  formula  based  decision 
surface  moves  away  from  Srank  and  erroneous  results  can  be  obtained  with  the  rank  based 
reordering  of  predicates. 

Reconsidering  the  results  of  Table  1,  the  exact  condition  for  pushing  down  a 
selection  predicate,  based  on  the  cost  formula  approach  was: 

m-Sj2'*So2 

 i>j>  Q 

Cj,<  ^  *Ca,-^-Ic 

l-Soi  m 

which  can  be  rewritten  as  follows: 

m-Sj2'*So2 

Ci  O  Ic 

—  <  —  *  Cai  A    where  A  is  a  constant  =  — 

Coi  1-Soi  m*Coi  Coi 

The  only  effect  of  constant  A  on  the  analysis,  performed  in  sections  4.3.1  and 
4.3.2,  will  be  to  change  the  position  of  S„  to  be  located  at  a  Z  coordinate  distance  of  A 
under  Smnk  rather  than  being  equal  to  it.  S„  therefore  becomes  distinct  from  Srank,  but  none 
of  the  influence  of  m  over  the  relative  positioning  of  Sm  and  Srank  is  affected,  and  our 
conclusions  are  still  valid. 
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Taking  into  consideration  delete  tokens  introduces  additional  algebraic 
complexity,  in  the  cost  formulas  and  in  their  notations  (mainly  for  AC),  but  no 
conceptual  change  with  respect  to  insert  tokens  considered  in  this  analysis.  The  extension 
to  cases  where  a  beta  node  has  more  than  two  children  is  briefly  covered  in  Appendix  B 
and  does  not  introduce  any  conceptual  change  either. 

In  conclusion,  the  influence  of  the  tokens  distribution  remains  an  important 
parameter,  causing  the  position  of  the  decision  surface  linked  to  the  cost  formula  to  vary 
widely  and  making  a  rank-based  decision  possibly  incorrect. 


CHAPTER  5 
STRATEGffiS:  DESIGN 

The  example  and  analysis  presented  in  Chapter  4  showed  why  the  rank  metric  was 
not  adopted  to  perform  the  placement  of  selection  predicates  in  discrimination  networks. 
The  position  of  the  decision  surface  for  pushing  down  (or  pulling  up)  a  selection 
predicate  varies  according  to  multiple  parameters  including: 

-  The  distribution  of  token  frequencies  among  input  edges, 

-  The  selectivity  factor  of  the  selection  predicate, 

The  selectivity  factors  of  the  network  edges  the  sigma  is  moved  along, 
The  cost  of  the  selection  predicate. 

The  differential  costs  of  the  joins  along  the  path  of  the  pushdown. 
And  by  ignoring  the  distribution  of  token  frequencies  among  input  edges  and  various 
node  related  costs,  the  rank  metric  is  likely  to  fail  to  produce  a  correct  reordering  of  the 
predicates  in  many  cases.  In  the  process  of  investigating  the  possible  strategies  that  can  be 
applied  towards  obtaining  a  correct  solution  this  problem,  the  most  straightforward  idea  is 
to  try  to  adapt  the  rank  metric  to  discrimination  network  by  trying  to  incorporate  the 
missing  parameters.  In  the  following  section,  the  idea  of  adapting  the  rank  metric  to  Gator 
networks  is  investigated  in  detail. 
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5. 1    Adapting  the  Rank  Metric  to  Gator  Networks: 
The  selection  predicate  pushdown  criterion  based  on  the  cost  formula  from  Table 


1  was: 

m-Sj:'*So2 


-Sj, 


1-Soi  m*(l-Sai) 
If  the  insertion  cost  Ic  is  considered  negligible  (Ic  =  0),  and  the  assumption  about  the 

differential  cost  Ch  being  constant  (see  section  4.1.2)  admitted  (AQz  =  0),  the  rank  of  the 

join  (with  respect  to  aO  can  be  redefined  to  the  following  expression: 

m-Sj2'*So2 


Sj 


rank(J/ai)  =   (3) 

Cji 

As  a  matter  of  fact,  rewriting  rank(J/ai)  <  rank(ai)  (the  rank  based  criterion  for  the 
selection  predicate  pushdown)  gives  now  the  same  result  as  the  cost  formula  based 
approach.  When  the  insertion  cost  Ic  is  taken  into  perspective,  the  definition  of  the  rank 

of  the  join  has  to  be  changed  to: 

,m-S/*S..    Ic*(l-So.),    _     ^m-Sj2'*S.2  ,  ^  ^ 
Sji-[  J    Sji-[  I-Ic*rank(ai)] 

rank(J/ai)  =  3^—  ^  =  ^  

Cji  Cji 

The  elimination  of  the  second  assumption  about  the  differential  cost  Ck  introduces 

So2  *  AC  2 

an  extra  term  of  the  form  —   in  the  numerator  of  the  expression  above.  Without 

m^Loi 

complicating  the  expression  of  the  rank  of  the  join  any  further,  it  can  be  noticed  from  the 
expression  above,  that  the  rank  of  the  join  is  not  an  independent  quantity  anymore,  but  is 
related  to  the  rank  of  the  selection  predicate  being  pushed  down.  Indeed  (and  unless  the 
insertion  cost  Ic  is  neglected),  parameters  of  the  rank  of  the  selection  predicate  now 
appear  in  the  expression  of  the  rank  of  the  join.  This  new  characteristic  of  the  rank  of  the 
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join  is  certainly  not  a  desirable  one,  since  the  positioning  of  a  selection  predicate  with 
respect  to  a  join  predicate  may  cause  the  rank  value  of  the  latter  to  change.  One  of  the 
consequences  of  this  is  an  impact  on  the  convergence  of  a  rank  based  algorithm  for  Gator 
networks. 

Clearly,  a  complete  analytical  model  for  a  Gator  networks  adapted  rank  metric, 
would  quickly  reach  the  limits  of  tractability,  especially  when  considering  the  following 
required  cases  to  be  incorporated  into  the  already  cumbersome  expression  shown  above: 

-  Incorporate  the  cases  where  beta  nodes  can  have  more  than  two  children. 

-  Incorporate  the  cost  of  delete  tokens  into  the  global  cost  analysis  of  the  network. 

-  Take  into  account  the  possibility  of  more  than  "one-level"  pushdowns. 

Working  with  a  (relatively)  simplified  representation  of  the  rank  such  as  the  one 
presented  in  equation  (3)  above  is  not  an  alternative  in  this  case  for  the  following  reason: 
The  resulting  loss  of  accuracy  would  make  this  model  inadequate  for  integration  with  the 
existing  Gator  networks  optimization  algorithm(s).  Indeed,  the  overall  accuracy  (and 
reliability)  of  the  optimizer  would  become  limited  by  that  of  its  "worst"  component,  the 
selection  predicate  placement  module  in  this  matter.  Therefore,  adding  the  selection 
predicate  placement  heuristic  would  reduce  the  accuracy  of  the  existing  optimizer,  which 
is  not  a  desirable  effect. 

While  the  ordering  of  selections  and  joins  based  on  the  rank  metric  does  yield 
satisfying  results  in  query  processing,  it  is  our  opinion  that  trying  to  adapt  it  to  the 
problem  of  ordering  selection  predicates  among  joins  in  a  Gator  network  will  suffer  from 
the  following  major  limitations: 
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-  First,  the  rank  metric  does  not  naturally  extend  the  existing  cost  model  and 
randomized  algorithm(s)  described  earlier  in  Chapter  3,  based  on  a  set  of  local 
change  operators.  Thus,  the  introduction  of  the  notion  of  a  rank  ordering  of 
predicates  in  the  Gator  network  optimizer  would  prove  relatively  complex  to 
combine  with  the  existing  cost  formulas  in  order  to  achieve  a  homogeneous  and 
integrated  global  discrimination  network  optimizer. 

-  While  the  idea  of  developing  a  rank  metric  for  Gator  networks  is  conceptually 
feasible,  the  practical  formulation  of  such  a  metric  is  very  likely  to  become  extremely 
cumbersome  and  unpractical  of  use  within  the  optimizer.  The  impracticality  of  a 
Gator  network  rank  will  be  furthermore  increased  by  the  contextual  nature  it  exhibits, 
as  shown  earlier  in  this  section. 

Since  the  introduction  of  a  rank  based  heuristic  would  leave  us  with  either  a 
crippled  (using  a  simplified  version  of  the  rank)  or  an  overly  complex  and  ill-integrated 
optimizer,  it  has  been  decided  to  avoid  investigating  further  any  approach  based  on  an 
adaptation  of  the  rank  metric  to  Gator  networks.  Instead,  our  efforts  will  be  directed 
towards  the  design  of  alternative  strategies,  better  adapted  to  this  problem,  and  satisfying 
the  following  requirements: 

-  In  order  to  maintain  a  maximum  continuity  with  the  current  cost  model  and 
optimizer,  any  proposed  approach  should  reuse  the  existing  work,  including  cost 
formulas,  heuristics  and  local  change  operators.  This  will  also  have  the  advantage  of 
keeping  the  complexity  of  the  optimizer  as  low  as  possible  from  both  a  concepmal 
and  coding  /  maintenance  aspects. 
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-  A  rank  based  ordering  will  still  be  applied  whenever  possible,  such  as  during  a  final 
pass  on  groups  of  selection  predicates  lying  on  the  same  edge  of  a  network. 

The  remaining  sections  of  this  chapter  will  therefore  cover  a  set  of  proposed 
strategies  for  incorporating  selection  predicate  placement  heuristics  within  the  Gator 
network  optimizer.  Those  strategies  can  be  classified  into  two  major  groups: 

-  Low  complexity  strategies  with  little  to  no  impact  on  the  current  optimizer  search 
space  size  and  local  change  operators.  This  type  of  approach  will  use  two  distinct 
phases  and  will  be  therefore  referred  to  as  "Two  Phase  Strategies." 

Higher  complexity  methods  requiring  extensions  to  the  currently  proposed  model  and 
resulting  in  larger  search  space  sizes.  For  such  methods,  we  will  propose  a  minimal, 
but  necessary,  set  of  extensions  and  semantic  changes  to  the  present  local  change 
operators  "Kill  Beta"  and  "Merge  Sibling."  This  class  of  methods  runs  the 
optimization  in  one  single  phase  and  will  therefore  be  referred  to  as  "One  Phase 
Strategies." 

5.2   Two  Phase  Strategies: 

As  indicated  by  its  name,  this  approach  will  operate  in  two  distinct  phases  and  can 
be  labeled  as  "simple"  for  two  primary  reasons: 

-  It  does  not  increase  the  size  of  the  search  space  in  which  the  current  Gator  network 
optimization  algorithm  operates.  In  fact,  this  strategy  comes  as  an  additional  pass 
invoked  by  the  optimizer  after  the  best  Gator  network  has  been  found,  and  adds  a 
negligible  amount  of  work  to  the  existing  algorithm. 
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-    It  does  not  require  any  modification  to  the  existing  local  change  operators  that  were 

described  earlier  in  chapter  3. 
The  incorporation  of  this  method  into  the  existing  Gator  networks  optimizer  will  require  a 
minimum  set  of  perturbations  to  the  logic  of  the  existing  implementation.  An  algorithmic, 
high  level  C++  description  of  the  addition  of  this  method  to  the  existing  Gator  optimizer 
is  as  follows: 

5.2. 1  hitegrating  the  Simple  Strategy  With  the  Gator  Optimizer: 

Gator*  GetOptimalGator  (QueryGraph*  Graph) 
Begin 

GatorNet*  Best-Gator; 
Best-Gator  =  RunGatorOptimizer  (Graph) 
Best-Gator  =  RunSigmaPlacement  (Best-Gator) 
Return  (Best-Gator) 

End 

The  function  RunGatorOptimizer()  represents  an  abstraction  of  the  existing 
Gator  optimizer,  which  takes  a  rule  graph  as  a  parameter  and  returns  the  best  possible 
discrimination  network  with  the  selection  predicates  placed  above  the  alpha  nodes. 
RunSigmaPlacementO  takes  the  best  structure  found  so  far  and  tries  to  improve  it 
further  by  "sliding"  the  selection  predicate  nodes  (referred  to  as  sigmas)  down  the 
network.  The  proposed  heuristic  for  pushing  down  a  selection  predicate  is  driven  by  the 
token  frequencies  and  can  be  simply  described  in  English  as  follows: 

"Let  C  be  the  initial  cost  of  the  best  Gator  network  before  trying  to  push  down  the 
Sigma  node.  When  a  selection  predicate  a  is  pushed  down,  all  the  insertion,  deletion  and 
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join  operations  above  it  are  executed  more  often,  and  cause  a  cost  increase  AC"^.  In  order 
to  achieve  a  global  reduction  of  C,  the  a  operation  has  to  be  done  less  frequently,  and  the 
incurred  decrease  AC"  has  to  be  large  enough  to  counter  balance  the  increase  AC^.  A 
necessary  (but  not  sufficient)  condition  for  that  configuration  to  happen,  is  to  find  a 
location  in  the  network  where  the  frequency  of  tokens  Fi'  is  strictly  smaller  than  the 
original  Fj  when  the  a  was  at  the  top  of  the  network." 
Consider  the  case  of  Figure  5-1  below: 


Figure  5-1  Token  frequency  driven  heuristic 


Theorem: 


A  necessary  condition  for  the  positioning  of  Gi  above  the  P  node  to  be  beneficial  is: 


Fi2  <  Fi 


Proof; 

Consider  the  network  on  the  left  side  of  Figure  5-1  and  let: 

-  So,  and  Co,  represent  respectively  the  selectivity  and  cost  of  the  selection  predicate  a\. 
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-  Cj,  and  Cj2  represent  the  differential  cost  of  the  join  between  the  two  alpha  nodes  in 
the  sub  network  defined  by  the  dashed-line  box. 

Fi  represent  the  frequency  of  tokens  coming  into  a\. 

-  C  represent  the  total  cost  of  the  network  and  Ci  and  C2  represent  respectively  the 
costs  of  the  sub  networks  in  the  dashed-line  and  full  line. 

By  definition,  we  have: 

-  C  =  Ci+C2 

-  Ci  =  (Fi  *  Co,)  +  (Fi  *  So,  *  Cj,)  +  (F2  *  QO 

Consider  now  the  network  on  the  right  side  of  Figure  5-1  and  let  F12  represent  the 
frequency  of  tokens  flowing  through  the  selection  predicate  C\. 
Lemma: 

The  properties  of  relational  algebra  [Smi75,  Tam91,  U1182]  are  such  that  the  cost  of  the 
network  in  the  box  drawn  in  the  full  line  is  unchanged  and  remains  C2. 

The  cost  of  the  network  in  the  box  in  the  dashed  line  becomes: 
Cr  =  (Fi  *  Cj,)  +  (F2  *  Cj2')  -I-  (F12  *  Co,)  and  the  total  cost  is  C  =  C,'  +  C2 

C  <  C  0  C  +  C2  <  Ci  +  C2  o  Ci'  <  Ci 

In  order  to  achieve  C  <  C,  it  is  necessary  that  Ci'  <  Ci 

Comparing  the  terms  of  C|  and  Ci'  we  have: 

Fi  *  Cj,  >  F,  *  Sa,  *  Cj,  (Since  0  <  Sa,  <  1  by  definition) 

Similarly,  since  ai  becomes  larger  after  the  selection  pushdown,  the  differential  cost  Cj:' 
is  greater  or  equal  to  Q:,  and  we  have  F2  *  Cj2'  >=  F2  *  Q: .  Therefore,  if  F12  >  Fi,  it 
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follows  that  Fi2  *  Ca,  >  Fi  *  Ca,  and  Cf  >  Ci  since  all  three  components  of  Cr  are  larger 
than  those  of  Ci.  Hence,  the  necessary  condition  for  Ci'  to  be  smaller  than  Ci  is  that  F12 
be  (strictly)  smaller  than  Fi:  F12  <  Fi  (QED). 

Below  is  a  high-level  algorithmic  description  of  this  heuristic. 

Algorithm: 

(It  is  assumed  that  the  best  network  without  the  selection  predicate  placement 
optimization  is  returned  in  Best-Gator,  before  calling  this  function). 
RunSiginaPlaceinent(Gator*  Best-Gator) 
Begin 

C  =  NetCost  (Best-Gator) 
For  (each  <5\ ) 
F  =  Fi 

ai  =  parent  of  G\ 

Remove  ai  from  network  and  recompute  frequencies  on  path  tti  — >  Pnode. 
For  (each  node  n^  of  type  pk  or  ak  on  path  a;  Pnode) 
If  (Fk<Fi)  then 

New-Gator  =  PlaceSigma  (ai ,  nk) 
C  =  NetCost  (New-Gator) 
If(C'  <C)  then 

Delete  (Best-Gator) 
Best-Gator  =  New-Gator 
Else 

I    Delete  (New-Gator) 
Endif 
Endif 
EndFor 

EndFor 

For  (each  leaf     Pnode  path) 

If  (more  than  one  a  on  same  edge) 

I       Reorder  by  rank 

Endif 

EndFor 
End 
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5.2.2  Complexity  Analysis: 

A  Gator  network  with  N  tuple  variables  has  a  maximum  height  of  N-1  when  it  is 
a  left  (or  right)  deep  binary  tree  as  shown  in  Figure  5-2  below: 


Pnode 

Figure  5-2  Maximum  height  of  a  Gator  network 
For  each  selection  predicate,  the  function  RunSigmaPlacement  performs  a 
constant  amount  of  work  W  by  scanning  the  complete  path  from  the  sigma  node  all  the 
way  to  the  Pnode,  searching  for  a  suitable  position  to  relocate  it.  The  length  of  that  path  is 
at  most  (N-1)  and  the  number  of  selection  predicates  to  process  is  considered  not  to 
exceed  0(N)  (each  tuple  variable  can  have  one  or  a  few  selection  predicates  associated 
with  it,  in  the  case  of  conjunction  of  predicates,  such  as:  di  and  Cz).  The  total  amount  of 
work  spent  on  all  sigma  nodes  is  therefore  of  the  order  of  0(N)  *  (N  -  1 )  *  W  which  is 
0(WN  ).  If  any  type  of  sorting  of  the  selection  predicates  is  performed  as  a  preprocessing 
or  post-processing  step  (for  example,  a  final  sorting  by  rank  of  all  sigmas  lying  on  the 
same  edges),  the  added  amount  of  work  is  not  worse  than  O(N^)  [Cor90],  and  does  not 
increase  the  overall  complexity  of  the  function  which  remains  polynomial.  Since  the 
Gator  network  optimizer  calls  one  of  three  possible  operators  at  each  iteration,  the 
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dimension  of  the  search  tree  is  therefore  0(3"^)  for  N  iterations,  and  the  global  state  space 
is  exponential  by  nature.  (Some  states  could  be  visited  more  than  once  during  an 
optimization  phase,  but  the  upper  bound  still  remains  0(3^)).  Consequently,  the  addition 
of  RunSigmaPlacement  does  not  increase  the  size  of  the  state  space  from  a 
combinatorial  complexity  aspect. 

5.2.3  Influence  of  Indexes  on  the  Heuristic: 

It  was  mentioned  at  the  beginning  of  section  4.3  that  the  use  of  indexes  (and  the 
corresponding  cost  formulas)  was  not  taken  into  account  as  far  as  the  problem  of 
selection  predicate  placement  was  concerned.  The  reason  behind  this  simplification  is  the 
fact  that  the  pushdown  heuristic  is  independent  of  the  presence  or  absence  of  indexes  in 
the  Gator  network.  Indeed  (and  as  described  above),  the  only  parameter  considered  by  the 
heuristic  to  perform  a  pushdown  is  the  frequency  of  tokens  flowing  across  various  edges 
of  the  network  lying  between  the  location  of  the  selection  predicate  and  the  Pnode.  Since 
the  definition  of  an  index  over  a  (join)  attribute  has  no  effect  on  the  "number"  of  tokens 
generated  by  that  join  condition,  it  does  not  have  any  influence  on  the  resulting  frequency 
of  tokens  and  has  therefore  no  impact  on  the  logic  of  the  heuristic.  As  a  matter  of  fact,  the 
amount  of  tokens  generated  by  a  join  is  dependent  solely  on  the  selectivity  of  the  join 
condition,  which  is  a  property  of  the  join  itself.  While  the  presence  of  indexes  has  no 
consequence  on  the  heuristic  itself,  it  is  important  to  notice  that  indexes  will  modify  the 
costs  of  joins.  Consequently,  the  presence  of  indexes  will  have  an  impact  on  the  global 
decision  regarding  whether  the  cost  of  the  network  is  reduced  by  a  pushdown  or  not. 
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5.3   One  Phase  Strategies: 

Contrarily  to  the  strategy  described  in  the  previous  section,  the  existing  Gator 
optimizer  will  not  be  directly  reusable  in  a  straightforward  manner  it  was  in  the  previous 
case.  Instead  of  adding  the  selection  predicate  placement  as  an  extra  phase,  the  idea  here 
is  to  integrate  it  within  the  process  of  building  the  discrimination  network.  In  order  to  do 
so,  one  (or  more)  local  change  operators  are  added  to  the  existing  set  of  three:  "Create 
Beta,"  "Kill  Beta"  and  "Merge  Sibling."  These  new  operators  (Push  Sigma  and  Pull 
Sigma)  will  be  called  randomly  during  the  local  perturbations  phase,  and  compete  with 
the  others.  The  set  of  strategies  based  on  this  approach  is  of  higher  complexity  than  the 
previous  one  for  the  following  reasons: 

The  incorporation  of  new  operators  into  the  existing  algorithm  is  a  "double  edged 
sword."  Indeed,  the  immediate  consequence  is  a  substantial  increase  in  the  search 
space  size.  While  a  larger  search  space  allows  potentially  better  solutions  to  be 
found,  it  also  has  the  major  disadvantage  of  possibly  causing  the  optimizer  to  follow 
"dead-ends"  search  paths  while  searching  for  local  minimums.  An  important  tradeoff 
issue  of  optimization  complexity  versus  result  quality  is  to  be  considered  here. 
-  The  addition  of  new  operators  allowing  the  placement  of  a  a  node  while  the  network 
is  being  built,  brings  up  issues  of  semantic  nature  about  the  meanings  of  "Kill  Beta" 
and  "Merge  Sibling"  and  their  impact  on  the  integrity  of  the  Gator  network. 
Therefore,  the  behavior  of  some  of  the  existing  operators  has  to  be  revised  before  this 
type  of  approach  can  be  implemented. 
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5.3.1  Complexity  Analysis:  < 

In  this  type  of  strategy,  the  number  of  local  change  operators  is  increased  from 
three  to  five.  The  number  of  potential  branches  at  iterative  step  of  the  optimizer  is 
therefore  also  increased  from  three  to  five.  As  a  result,  the  total  potential  number  of  states 
that  can  be  visited  is  raised  from  0(3"^)  to  0(5'^)  for  N  iterations  (once  again,  it  is 
possible  that  multiple  states  can  be  visited  several  times  during  an  optimization  phase,  but 
the  upper  bound  still  remains).  This  type  of  approach  has  therefore  a  more  serious  impact 
on  the  complexity  than  the  previous  strategy:  even  though  the  order  of  magnitude  of  the 
search  space  is  not  increased,  the  potentially  added  size  is  exponential  versus  polynomial 
in  the  previous  case. 

5.3.2  Semantic  Issues  for  Kill  Beta: 


Figure  5-3  Semantic  issues  for  Kill  Beta 


Consider  the  case  of  Figure  5-3  above.  The  left  side  of  the  figure  describes  a 
configuration  where  a  selection  predicate  a  has  been  pushed  down  in  the  network,  just 
above  node  pi,  as  the  result  of  a  previous  local  change  operator  sequence.  Suppose  that 
the  next  iteration  of  the  optimizer  decides  to  perform  a  "Kill  Beta"  operation  on  node  Pi. 
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If  the  "Kill  Beta"  operator  is  kept  unchanged  (i.e.  behaves  the  same  way  it  does  in  the 
current  optimizer  and  as  described  in  chapter  3),  once  the  operation  is  performed  (node  (3] 
killed),  the  node  a  becomes  sibling  of  ^2,  as  shown  on  the  right  side  of  Figure  5-3. 
Consequently,  every  token  inserted  into  |32  has  to  join  with  a,  which  is  not  supported  by 
the  semantics  of  Gator  networks  algorithms,  since  a  nodes  are  predicates  only,  and  not 
memory  nodes.  As  a  result,  the  Gator  network  gets  into  the  inconsistent  state  shown  on 
the  right  side  of  the  figure  above,  and  a  fix  becomes  necessary  for  this  type  of  situation. 

5.3.3  Semantic  Issues  for  Merge  Sibling: 


Pnode 

Figure  5-4  Semantic  issues  for  Merge  Sibling 


In  the  case  of  a  Figure  5-4  above,  a  Merge  Sibling  operation  is  considered  after 
the  selection  predicate  has  been  pushed  above  Pi.  When  node  p2  is  merged  with  Pi,  it 
becomes  sibling  of  a  as  shown  on  the  right  side  of  Figure  5-4.  This  situation  again  causes 
inconsistency  in  the  structure  of  the  Gator  network  similar  to  the  one  presented  in  5.3.2. 
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5.3.4  Suggested  Fixes  for  the  Semantic  Issues: 

There  are  multiple  solutions  that  can  be  applied  towards  correcting  the 
inconsistent  state  in  which  a  Gator  network  might  fall  during  the  operations  described  in 
sections  5.3.1  and  5.3.2. 
Conservative  approach:  No  Merge-No  Kill: 

The  conservative  approach  is  to  simply  "forbid"  the  killing  or  the  merging  of  a 
beta  node,  when  a  sigma  node  is  located  above  the  node  to  kill  or  the  merge  node.  While 
this  approach  is  the  simplest  from  an  implementation  point  of  view,  it  may  be  limited  by 
the  disadvantage  of  possibly  reducing  the  number  of  operations  the  optimizer  can 
perform,  which  can  in  turn  have  some  impact  on  the  results  that  the  heuristic  could 
produce. 

Progressive  approaches: 

A  more  progressive  approach  consists  of  allowing  the  operation  to  take  place,  but 
taking  extra  actions  to  fix  the  inconsistency  of  the  network  by  relocating  the  sigma  node 
to  a  "suitable"  location  in  the  network.  This  would  become  equivalent  to  combining  a 
"Push  Sigma"  or  "Pull  Sigma"  with  the  "Kill  Beta"  or  "Merge  Sibling,"  every  time  this 
type  of  inconsistent  situation  is  encountered.  An  example  solution  would  be  to  bring  the 
selection  predicate  back  to  the  top  of  the  network,  which  will  be  referred  as  the  "Pop  To 
Top"  approach.  The  decision  on  where  to  relocate  the  selection  predicate  is  left  as  an 
implementation  issue. 
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5.3.5  Sub-Strategies: 

Depending  on  different  placement  options  for  the  selection  predicates,  it  is 
possible  to  generate  a  number  of  sub  strategies  all  derived  from  the  same  idea  consisting 
of  mixing  the  three  traditional  local  change  operators  with  the  selection  predicate 
placement  ones.  Some  examples  are  listed  below: 
Sigma  placement  during  network  build  up: 


The  placement  of  a  nodes  at  different  levels  of  the  discrimination  network  can  be 
performed  during  the  network  build-up  phase,  before  applying  any  local  transformation. 
A  more  conservative  approach  would  require  that  new  networks  be  always  built  with 
selection  predicates  at  the  top  and  that  only  local  perturbation  operators  perform  the 
placements. 

Sigma  Push  and  Pull: 

The  local  perturbation  operators  "Push  Sigma"  and  "Pull  Sigma"  can  themselves 
operate  in  a  number  of  different  combinations: 

-  Random  Push  and  Pull:  the  placement  of  the  selection  predicates  is  driven  by  random 
decisions  only. 

-  Searched  Push,  Random  Pull:  The  pushdown  operation  is  guided  by  the  heuristic 
described  for  the  Two  Phase  strategy  above,  while  the  pulling  up  is  done  randomly. 
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-  Extensive  search  for  pull-ups  and  searched  push  downs  for  networks  with  small 
numbers  of  tuple  variables:  extensively  search  all  the  possible  placements  for  the 
selection  predicate  on  the  path  from  the  leaf  to  the  Pnode  when  performing  a  pull-up 
and  apply  the  heuristic  for  a  pushdown. 

Livestigating  the  possible  variants  analytically  would  be  exciting  but  is  a  very 
difficult  and  probably  not  beneficial  task,  because  of  the  complexity  of  the  corresponding 
mathematical  models.  It  is  our  belief  that  the  investigation  of  the  relative  performances, 
advantages  and  disadvantages  of  each  sub-strategy  will  be  better  handled  at  the 
implementation  level,  with  numerical  results. 

In  the  next  chapter,  we  will  describe  an  implementation  of  the  Gator  network 
optimizer  with  selection  predicate  placement  capabilities,  and  discuss  some 
implementation  issues  and  difficulties,  and  proceed  with  some  simulation  results  for  both 
simple  and  advanced  strategies  (Two  Phase  and  One  Phase). 


CHAPTER  6 
STRATEGIES:  IMPLEMENTATION 

In  the  previous  chapter,  we  presented  a  detailed  description  of  the  proposed 
strategies  to  perform  selection  predicate  placement  in  Gator  networks  from  a  design  point 
of  view.  In  this  chapter,  the  reader  will  find  an  elaboration  on  some  important  issues  and 
questions  that  arise  on  different  aspects  of  the  optimizer,  when  the  proposed  strategies  are 
put  into  practice.  The  purpose  of  this  chapter  was  not  to  write  a  complete  full-fledged 
implementation  of  a  Gator  network  optimizer,  with  selection  predicate  placement 
capabilities.  Rather,  the  goal  was  to  investigate  and  bring  into  light  the  possible 
roadblocks  and  coding  difficulties  that  the  implementers  might  encounter  when  writing 
most  of  the  proposed  strategies.  The  important  points  that  will  be  addressed  in  this 
chapter  are  as  follows: 

Investigate  the  coding  difficulties  related  to  a  new  (more  efficient)  implementation 
of  the  local  change  operators. 

-  Look  over  any  necessary  revisions  to  be  applied  to  cost  formulas  based  on  the  new 
operators  allowing  relocation  of  selection  predicates. 

-  Verify  whether  the  proposed  strategies  work  and  improve  the  Gator  network. 

-  Compare  the  relative  running  times  and  result  quality  of  both  the  simple  and  complex 
strategies,  and  find  out  whether  the  latter  yields  any  tangible  improvement  with 
respect  to  the  simple  one,  and  in  what  cases. 
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6. 1    Implementation  of  Local  Change  Operators: 

The  local  change  operators  "Create  Beta,"  "Kill  Beta"  and  "Merge  Sibling"  were 
implemented  in  a  modified  version  of  Ariel  [Han97b]  by  regenerating  a  completely  new 
network  at  each  iteration.  The  function  ApplyPerturbation  described  in  section  3.1.2: 

New-Gator  =  ApplyPerturbation  (CurrentGator,  Perturbation) 
takes  as  parameters  the  current  Gator  network  and  one  of  the  local  change  operators,  and 
applies  the  change  to  return  a  new  network  that  is  a  duplicate  of  the  current  one  with  the 
local  change  applied  to  it.  If  the  cost  of  the  network  is  improved  by  the  current  operation, 
then  the  old  network  is  discarded,  and  the  new  one  kept  as  the  current  network,  otherwise, 
the  new  network  is  discarded  and  the  old  one  remains  as  the  current  network. 

Having  seen  room  for  improvement  in  both  the  memory  use  and  speed  of  the 
optimizer,  we  have  decided  to  implement  the  local  change  operators  differently  this  time 
around.  Each  of  the  operators  comes  in  a  tripartite  structure  (three  function  or  methods  in 
the  object  oriented  terminology)  which  allows  trying  it  and  then  either  validating  or 
undoing  the  modifications  that  were  made  to  the  network,  based  on  the  old  and  new  costs. 
Consequently,  each  of  the  local  change  operators  is  implemented  with  the  following 
methods: 

-  Gator*  TryOperator  (Gator*,  char*  LocalChange) 

-  Gator*  DoOperator  (Gator*,  char*  LocalChange) 

-  Gator*  UndoOperator  (Gator*,  char*  LocalChange) 

with  Operator  e  {Create  Beta,  Kill  Beta,  Merge  Sibling,  Push  Sigma,  Pull  Sigma}. 
As  mentioned  above,  the  advantages  of  this  approach  are  straightforward: 
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-  Better  memory  usage  (only  one  network  in  memory  all  the  time). 

-  Faster  optimization  process  by  not  having  to  regenerate  and  delete  a  whole  network 
at  each  step. 

The  implementation  of  those  methods  required  maintaining  some  additional  state 
information  within  the  structure  of  the  nodes  of  the  network  (old  neighbors,  old  parents, 
old  children  etc)  in  order  to  allow  the  rollback  when  necessary.  However,  the  duplicated 
information  is  always  localized  to  the  "region"  of  the  network  being  affected  by  the 
current  perturbation  only,  and  therefore  kept  to  a  minimum. 

6.2   Revision  of  the  Cost  Formulas  hnplementation: 

Since  selection  predicates  were  systematically  placed  at  the  top  of  the  Gator 
network  in  the  traditional  optimizer,  they  did  not  have  a  "proper"  existence  but  were 
rather  considered  as  a  preprocessing  filtering  step  to  the  network.  Consequently,  the  cost 
of  the  selection  was  not  computed  as  a  part  of  the  global  cost  of  the  network.  In  this 
implementation,  the  ability  to  relocate  a  selection  predicate  potentially  anywhere  in  the 
network,  has  necessitated  implementing  it  as  an  entity  (node)  with  an  associated  method 
to  compute  its  cost  contribution.  Furthermore,  the  possible  presence  of  a  sigma  just  above 
a  beta  node  has  been  at  the  origin  of  a  few  minor,  but  necessary  modifications  to  the 
method  for  computing  the  cost  of  beta  nodes.  A  more  detailed  discussion  of  both  these 
issues  is  presented  in  the  following  sections. 

6.2.1  Cost  of  a  Sigma  Node: 

Three  distinct  cases  were  identified  for  computing  the  cost  of  a  selection 
predicate: 
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-  The  sigma  is  at  the  top  of  the  network. 

-  The  sigma  is  pushed  down  and  has  "regular"  children  (Alphas  and  Betas). 

-  The  sigma  is  pushed  down  and  has  another  sigma  node  as  a  child. 
A  description  of  each  of  the  three  cases  above  follows. 

Sigma  at  the  top  of  the  network: 


Figure  6-1  Sigma  at  the  top  of  the  network 
In  this  case  depicted  in  Figure  6-1  above,  the  cost  of  the  sigma  node  is  simply  the 
frequency  of  tokens  flowing  through  the  selection  predicate,  multiplied  by  the  cost  of 
computing  the  selection  condition  (noted  SelectionCost). 

SigmaCost  (ai)=  Fi  *  SelectionCost  {d) 


Sigma  with  regular  (Alpha  and  Beta)  children: 


Figure  6-2  Sigma  with  regular  children 
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In  this  case,  the  selection  predicate  cost  will  have  two  components: 

-  The  cost  of  computing  the  selection  condition  multiplied  by  the  frequency  of  tokens 
flowing  through  the  predicate.  This  frequency  of  tokens  is  computed  in  a  similar 
fashion  as  it  is  for  beta  nodes,  by  adding  up  the  anticipated  participation  of  each  of 
the  children  nodes  in  producing  the  amount  of  tokens  passing  through  the  sigma 
node. 

-  The  cost  of  executing  the  joins  across  the  children  is  also  added  to  the  cost  of  the 
selection  predicate  here,  as  it  is  the  case  for  beta  nodes. 

SigmaCost  (ai)=  Fj  *  SelectionCost  (0,)  +  CostJoinChildren(ai) 

Note: 

The  reader  will  notice  that  the  PerChildInsertCost()  and  PerChildDeleteCost() 
components  that  are  also  added  to  a  beta  node  with  children  is  not  applied  to  the 
selection  predicate.  The  reason  for  this  is  that  sigmas  are  not  nodes  in  which  tokens  can 
be  inserted  into  (or  deleted  from)  but  simply  predicates  acting  as  filters  and  allowing 
tokens  through,  or  not.  The  PerChildInsertCost()  and  PerChildDeleteCost()  components 
cannot  however  be  lost,  and  have  to  be  saved  until  the  first  beta  node  above  this  sigma 
is  found.  At  that  point,  the  per-child  insertion  and  deletion  costs  are  applied  to  that  beta 
node.  In  order  to  handle  this  "transfer"  of  cost  to  the  next  beta  node  in  the  network,  it 
was  necessary  to  attach  to  sigma  nodes  a  simple  structure  with  one  entry  for  each  child. 
Each  entry  of  the  list  maintains  the  join  size  (with  respect  to  that  child),  and  insert  and 
delete  frequencies  of  that  child.  We  will  refer  to  these  lists  as  "update  costs  lists"  and  a 
sample  one  is  presented  in  Figure  6-3  below: 
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a, 

 ► 

join  size  *  Sel(ai) 

join  size  *  Sel(Oi) 

join  size  *  Sel(ai) 

iFreq  *  Sel(ai) 

iFreq  *  Sel(ai) 

iFreq  *  Sel(Oi) 

dFreq  *  Sel(ai) 

dFreq  *  Sel(ai) 

dFreq  *  Sel(Oi) 

Figure  6-3  Sigma  list  for  transfers  of  PerChildlnsertCost  and  PerChildDeleteCost 
In  this  case,  <5\  has  been  pushed  just  above  the  beta  node.  The  cost  of  G|  includes 
the  cost  of  executing  the  selection  predicate  computation  (Fai  *  SelectionCost(ai))  plus 
the  costs  of  executing  the  joins  originating  from  each  of  the  children  tti,  and  a^.  The 
cost  relative  to  the  insertion  or  deletion  of  the  results  of  those  joins  is  not  applied  at  the 
level  of  CS\,  but  is  transferred  to  the  P  below  G\  through  the  list  described  above.  The  join 
size,  insert  frequency  and  delete  frequency  are  multiplied  by  the  selectivity  of  the 
selection  predicate  (filtering  effect)  and  appended  to  the  list  described  above,  to  be 
processed  at  the  level  of  the  beta  node  for  computing  the  PerChildInsertCost()  and 
PerChildDeleteCostO  components. 

Sigma  with  a  Sigma  child: 

When  multiple  selection  predicates  are  pushed  down  during  the  optimization 
phase,  it  is  possible  to  reach  configurations  where  two  (or  more)  sigmas  end  up  on  the 
same  edge  of  a  network.  In  such  a  case,  a  sigma  node  has  another  sigma  as  a  child.  (Note 
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that  sigmas  are  always  "unique"  children  of  their  parent  node).  This  case  is  illustrated  in 
Figure  6-4  below: 


F,  F: 


«i'''"^«2'''""a3 


^3 


Figure  6-4  Sigma  with  a  Sigma  child 
The  selection  predicate  C2  has  a  unique  child  Gi  and  will  simply  "relay"  the 
update  costs  list  from  child  d,  after  applying  the  filtering  effect  Sel(C2)  to  each  value  in 
the  list  of  G\ .  The  cost  of  a2  is  here  again  reduced  to  the  selection  cost  only: 

SigmaCost  (02)=  Foj  *  SelectionCost  (CT2) 
Where  Fa2  is  computed  as  Fai  *  Sel(a2). 

6.2.2  Cost  of  a  Beta  Node: 

As  a  reminder  of  the  Gator  network  cost  model  presented  in  Chapter  3,  the  cost  of 
a  beta  node  is  computed  as  follows: 

LocalCost(  p)  =     ^  { Fi(N)  *  InsertCost  (N,  P)  +  Fd(N)  *  DeleteCost  (N,  p)} 

Nechjldren(P) 
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Where  procedures  InsertCost  and  DeleteCost  include  respectively  the  cost  of 
performing  the  join  across  all  children,  plus  the  costs  of  inserting  and  deleting  from  the 
beta  node.  This  is  described  in  Figure  6-5  below: 

1  (^2 


ai'  '  -  a2''  "  as 


Join  result 

(3   Cost  =  Join  cost  +  update  costs 
Figure  6-5  Cost  of  a  Beta  node 
While  this  is  valid  for  beta  nodes  with  "regular"  children,  the  presence  of  a  sigma 
node  right  above  a  beta  has  an  impact  on  the  direct  application  of  this  cost  formula. 

Fi     F2  F3 
O2  03 


II     I       I      \^  Selection  result 


5  Cost  =  update  costs 
Figure  6-6  Beta  node  with  a  sigma  child 
As  shown  in  Figure  6-6  above,  updates  resulting  from  joins  across  the  children  of 
the  beta  node  are  no  longer  directly  done  to  the  beta  node,  but  go  through  the  filtering 
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effect  of  the  selection  predicate  first.  Consequently,  the  update  cost  for  the  beta  node  is 
computed  from  the  "update  costs  list"  associated  with  the  sigma  below  it. 

6.3   The  Optimizer: 

The  optimizer  was  written  in  C++  and  reached  a  size  of  about  ten  thousand  lines 
of  code.  Approximately  one  third  of  that  code  is  dedicated  to  the  implementation  of  the 
local  change  operators  "Create  Beta,"  "Kill  Beta,"  "Merge  Sibling,"  "Push  Sigma"  and 
"Pull  Sigma,"  each  of  them  written  in  the  tripartite  structure:  try  operator,  undo  operator 
and  do  operator  described  earlier  in  6.1.  In  this  section,  the  reader  will  find  a  short 
description  of  the  major  features  of  this  Gator  network  optimizer. 

6.3.1  Implemented  Randomized  Algorithms: 

The  main  difficulty  in  writing  this  optimizer  lay  in  the  correctness  of  each  of  the 
three  methods  implementing  each  of  the  five  local  change  operators.  In  order  to  keep  a 
better  focus  on  this  task,  it  was  decided  to  implement  only  Iterative  Improvement  as  a 
general  randomized  algorithm  to  be  executed  by  this  optimizer.  The  addition  of  other 
methods  such  as  Simulated  Annealing,  should  however  be  a  straightforward  task  once  the 
local  change  operators  are  available. 

Below  is  an  algorithmic  description  of  our  implementation  of  Iterative 
Improvement. 
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Begin 

double  oldCost,  newCost 

Gator  *BestGator,  *CurrentGator 

int  maxNetworks,  maxFailures,  numNetworks  =  0,  numFailures 
BestGator  =  BuildlnitialGatorNetwork  (RuleGraph) 
While  (numNetworks  <  maxNetworks) 

CurrentGator  =  BuildlnitialGatorNetwork  (RuleGraph) 

numFailures  =  0 

While  (numFailures  <  maxFailures) 
Operator  =  GetOp(random) 
oldCost  =  Cost(CurrentGator) 

CurrentGator  =  TryOperator  (CurrentGator,  Operator) 
newCost  =  Cost(CurrentGator) 
If  (newCost  <  oldCost) 

CurrentGator  =  DoOperator  (CurrentGator,  Operator) 

numFailures  =  0 

Else 

CurrentGator  =  UndoOperator  (CurrentGator,  Operator) 
numFaiIures++ 

Endif 
EndWhile 

If  (Cost(CurrentGator)  <  Cost(BestGator)) 
Delete  (BestGator) 
BestGator  =  CurrentGator 

Else 

Delete  (CurrentGator) 

Endif 

NumNetworks++ 
EndWhile 

End 
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Description  of  the  algorithm  above: 

The  algorithm  generates  a  first  discrimination  network  that  is  considered  to  be  the 
best  one  so  far  and  enters  the  double  while  loop.  A  new  network  is  then  generated  (from 
the  rule  graph)  at  each  step  of  the  outer  while  loop,  and  modified  with  local  change 
operators  within  the  inner  loop.  At  each  step  of  the  inner  loop,  one  of  the  operators  is 
randomly  selected  and  applied  to  the  network  with  the  TryOperator  method.  If  the  local 
change  improves  (reduces  the  cost)  of  the  network,  then  the  operation  is  validated, 
otherwise  it  is  undone.  The  exit  criterion  from  the  inner  loop  is  based  on  a  maximum 
number  of  successive  failures  while  trying  to  improve  the  network  by  applying  local 
operators.  Each  time  the  cost  of  the  network  is  reduced,  the  number  of  successive  failures 
is  reset  to  zero.  When  the  inner  counter  reaches  the  upper  bound  "maxFailures,"  it  is 
assumed  that  the  optimization  process  is  stuck  in  a  local  minimum  and  it  is  time  to 
generate  a  new  network.  The  current  network  is  then  compared  with  the  current  overall 
best  (BestGator)  and  the  one  with  the  lowest  cost  is  kept.  This  process  is  repeated  until 
the  maximum  number  of  networks  has  been  generated  in  the  outer  loop. 
Note: 

Both  parameters  maxNetworks  and  maxFailures  are  dynamically  generated  based 
on  the  number  of  tuple  variables  in  the  rule  graph,  allowing  by  this  way  longer  searches 
for  larger  rules  (and  networks).  The  factors  connecting  the  number  of  tuple  variables  to 
maxNetworks  and  maxFailures  are  also  parameters  to  the  optimizer.  Minimum  thresholds 
for  maxNetworks  and  maxFailures  are  also  set  in  order  to  guarantee  a  "reasonable" 
amount  of  searching  in  all  cases. 

Below  is  a  quick  overview  of  the  most  important  parameters  of  the  optimizer. 
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6.3.2  The  Optimizer  Characteristics: 

Unless  specified  otherwise  for  a  particular  test,  the  characteristics  of  the  optimizer 
will  be  tuned  as  follows: 
Sizes  of  relations: 

The  cardinalities  of  the  relations  will  range  from  a  minimum  of  1000  to  a 
maximum  of  100,000  tuples.  The  distribution  of  the  cardinalities  uses  an  80/20  rule  with 
a  majority  of  small  to  medium  relations  (80%)  and  a  minority  of  large  ones  (20%). 
Selectivity  of  joins: 

The  selectivity  of  joins  will  be  tuned  so  that  the  result  of  a  join  is  never  larger  than 
the  largest  of  the  base  nodes  participating  in  it. 
lOWeight  and  CPUWeight: 

Secondary  storage  access  time  has  not  improved  dramatically  during  the  past 
decade.  On  the  other  hand,  microelectronics  and  VLSI  technology  improvements  have 
caused  central  processing  units  to  almost  double  in  power  every  couple  of  years.  The 
weight  of  CPU  operations  is  therefore  becoming  less  and  less  important  with  respect  to 
that  of  a  disk  access.  The  ratio  of  lOWeight  to  CPUWeight  has  been  set  to  2000  to 
simulate  tests  on  Pentium  Pro  workstations  running  at  clock  speeds  of  200Mhz. 
Size  of  disk  pages: 

The  size  of  disk  pages  will  be  set  to  8  Kilobytes. 
Size  of  tuples: 

The  maximum  size  of  tuples  has  been  set  to  100  bytes. 
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6.4   Experimental  Results: 

In  this  section  we  will  present  some  experimental  results  which  were  obtained 
after  implementing  chosen  variants  of  the  strategies  described  in  chapter  5:  Two-Phase 
Strategy  (noted  TPS)  and  One-Phase  Strategy  with  No-Merge  No-Kill  (noted  OPS).  All 
the  tests  were  run  on  an  8-way  248  Mhz  SUNW,  UltraSPARC-II  CPU  Sun  Sparc 
machine  (model  Ultra  Enterprise  4000/5000)  with  2  GB  of  memory  and  a  reduced 
workload,  allowing  our  process  to  dispose  of  the  full  computing  power  of  one  processor 
per  run.  The  running  times  are  therefore  for  one  CPU,  and  since  none  of  the  tests  was  run 
with  the  machine  in  single  user  mode,  most  results  may  not  be  "exactly"  reproducible. 

6.4.1  Two  Phase  Strategy  Running  Times: 

From  the  complexity  analysis  of  section  5.2.2,  it  was  expected  that  the  second 
phase  of  a  two-phase  strategy  should  not  have  any  significant  impact  on  the  state  space  of 
the  optimizer.  Consequently,  the  addition  of  the  second  phase  in  which  selection 
predicates  are  relocated  in  the  network  should  not  have  a  significant  impact  on  the 
optimizer's  running  times. 
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Figure  6-7  Comparing  running  times  of  Phases  I  and  n  of  TPS 
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In  the  graph  from  Figure  6-7  above,  we  have  plotted  the  running  times  (in 
seconds)  of  phases  I  and  II  of  the  optimizer  for  a  number  of  tuple  variables  ranging  from 
2  to  15.  While  the  total  searching  time  (summation  of  running  times  of  both  phases) 
sensibly  increases  with  the  number  of  tuple  variables,  notice  that  the  impact  of  the 
selection  predicate  placement  phase  remains  negligible  with  respect  to  the  running  time 
of  the  first  phase.  The  figure  below  represents  the  ratios  of  the  running  times  of  both 
phases,  for  the  same  range  of  number  of  tuple  variables. 


Ratios  of  running  times  PII/PI 


-  2.00% 
w 

2  1.50% 
a. 

o  1.00% 
c 

■B  0.50% 
0.00% 


nnnfifinnnnn 

■  phase  II 


1    2    3   4   5    6    7    8    9  10  11  12  13  14  15 
Number  of  tuple  variables 


Observe  that  the  running  time  of  the  second  phase  remains  a  small  fraction  (under 
2%  on  average)  of  the  running  time  of  the  first  phase.  Also,  as  the  number  of  tuple 
variables  increases  (above  four  or  five),  the  ratio  decreases  sensibly  to  values  not  larger 
than  0.5%  on  average.  This  is  due  to  the  fast  increase  of  running  times  of  Phase  I  for 
networks  with  sizes  larger  than  four  tuple  variables  (see  Figure  6-7). 


6.4.2  Testing  Running  Times  for  Both  Strategies: 

In  this  paragraph  we  will  perform  a  quantitative  analysis  of  the  speed  of  each  of 
the  implemented  strategies,  with  respect  to  different  settings  of  the  selectivity  factors  of 
joins  and  selection  predicates.  The  next  section  will  focus  on  the  quality  of  the  results 
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(costs  of  networks).  Throughout  this  section  (6.4.2),  the  vertical  bars  with  a  percentage 
vertical  axis  will  represent  the  difference  (in  percent)  between  the  running  times  of  OPS 
and  TPS:  100  *  (time(OPS)  -  time(TPS))  /  time(OPS).  Below  is  a  series  of  three  tests 
focusing  on  the  relative  running  times  of  the  OPS  with  respect  to  the  TPS  approach  for 
different  configurations. 


Test  with  selective  joins  and  non-selective  sigmas: 

The  initial  set  of  joins  and  sigma  selectivity  factors  was  defined  at  opposite 
extremes  in  order  to  allow  a  better  analysis  of  changing  of  each  of  them  on  the  global 
running  times  of  each  strategy.  In  the  first  graph  below  (Figure  6-8),  joins  were  selective 
(the  result  of  a  join  is  smaller  than  the  base  nodes  participating  in  the  join),  but  selection 
predicates  were  non-selective  (majority  of  sigmas  with  selectivity  larger  than  0.9). 
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Figure  6-8  Relative  running  times  with  selective  joins  and  non-selective  sigmas 
From  Figure  6-8  above,  it  can  be  noticed  that  the  two-phase  strategy  is 
systematically  faster  than  the  one-phase  approach.  The  difference  in  running  times 
increases  from  20%  for  small  networks  to  approximately  60%  for  larger  ones  (mean  at 
42%),  indicating  running  times  for  OPS  about  twice  the  length  those  of  TPS.  The 
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absolute  values  for  the  running  times  ranged  from  0.02  to  52  seconds  for  OPS  and  from 
0.02  to  28  seconds  for  TPS.  In  order  to  examine  the  impact  of  the  selectivity  of  joins,  the 
next  test  will  be  run  with  the  selectivities  of  joins  set  higher  (less  selective  joins)  and  the 
same  selection  predicates. 

Test  with  non-selective  joins  and  non-selective  sigmas: 

In  the  graph  below,  the  same  tests  were  run  with  the  selectivity  factors  of  the  joins 
set  to  values  about  5  times  larger  than  those  from  Figure  6-8. 
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Figure  6-9  Relative  running  times  with  less  selective  joins,  non-selective  sigmas 
While  one  phase  optimization  is  still  slower,  a  clear  reduction  in  the  differences 
between  running  times  can  be  noticed  immediately  (average  of  the  difference  is  now  at 
23%).  This  reduction  is  due  to  both  a  decrease  in  the  running  times  of  OPS  as  well  as  an 
increase  in  the  running  times  of  TPS  as  shown  in  Figure  6-10  below: 
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Variations  in  running  times  of  TPS 
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Figure  6-10  Variations  in  running  times  of  OPS  and  TPS  with  less  selective  joins 
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In  order  to  examine  the  effect  of  the  selectivity  of  selection  predicates  on  each 
strategy,  the  following  test  was  run  with  the  same  non-selective  joins  (Figure  6-9),  and 
the  selection  predicates  made  more  selective  (a  majority  of  the  selection  predicates  will 
have  their  selectivity  set  under  0. 1 ). 

Test  with  non-selective  joins  and  selective  sigmas: 

From  Figure  6-11  below,  it  can  be  noticed  that  the  gap  between  the  running  times 
of  one  and  two  phase  strategies  is  again  widened  in  favor  of  TPS  when  the  selection 
predicates  are  made  more  selective.  The  average  of  the  difference  between  the  running 
times  climbs  to  approximately  60%,  giving  TPS  optimization  times  about  three  times 
shorter  than  OPS  on  average. 
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Figure  6-11  Relative  running  times  with  large  joins  and  selective  sigmas 

This  increase  in  the  difference  of  running  times  between  both  strategies  is  caused 
by  a  moderate  increase  in  the  running  times  of  OPS  combined  with  a  significant  decrease 
in  the  running  times  of  TPS.  This  variations  (illustrated  in  Figure  6-12  below)  were 
expected  since  the  effect  of  increasing  the  selectivity  of  selection  predicates  is  identical  to 
that  of  increasing  the  selectivity  of  joins  from  a  structural  point  of  view:  both  lead  to 
higher  and  narrower  networks. 
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Variations  in  running  times  of  OPS 
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Variations  in  running  times  of  TPS 
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Figure  6-12  Variations  in  the  running  times  of  OPS  and  TPS  with  selective  sigmas 
6.4.3  Analysis: 

Before  looking  into  the  analysis  of  the  results  of  the  various  running  times  of  OPS 
and  TPS,  it  is  necessary  to  make  a  few  preliminary  remarks  and  define  the  meaning  of  a 
term,  which  will  be  repeatedly  reused  throughout  the  analysis. 


Definition  of  a  No-Op: 

We  will  define  a  "No-Op"  as  a  call  to  a  local  change  operator  that  cannot  be 
executed  because  the  node  on  which  it  is  being  tried  does  not  qualify  to  perform  it.  In 
such  a  case,  the  optimizer  increments  the  inner-loop  counter  and  randomly  generates  a 
call  to  a  new  operator.  Examples  of  a  No-Op  are: 

-  A  call  to  "Create  Beta"  on  a  node  that  has  only  two  children 

-  A  call  to  "Merge  Sibling"  with  a  beta  node  having  only  two  children. 

-  A  call  to  "Merge  Sibling"  with  a  beta  node  that  has  only  alpha  children. 

The  running  time  of  a  No-Op  will  be  essentially  zero  comparatively  with  a 
complete  sequence  consisting  of  trying  an  operator,  followed  by  either  a  validation  or  a 
rollback.  The  proportion  of  No-Ops  in  a  test  run  will  therefore  have  a  significant  impact 
on  the  overall  running  time  of  the  optimization  process. 
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Preliminary  remarks  on  the  effects  of  selectivity  factors: 

The  increase  in  the  selectivity  factors  of  joins  (joins  less  selective)  produces  larger 
internal  nodes,  which  makes  them  more  expensive  to  maintain.  Consequently,  the 
optimizer  generates  "flatter"  networks  (internal  nodes  have  more  children)  when  joins 
become  less  selective.  Inversely,  when  joins  are  selective,  the  shape  of  the  network  tends 
towards  that  of  a  binary  tree  where  each  internal  node  has  two  children.  The  same  remark 
can  be  made  on  the  selectivity  of  selection  predicates:  the  less  selective  they  are,  the 
larger  the  internal  nodes  will  be  and  the  flatter  the  networks  will  become,  and  vice  versa. 

The  immediate  assertion  that  can  be  made  from  the  outcome  of  the  previous  three 
tests  is  that  the  two-phase  strategy  is  sensibly  faster  than  the  one-phase  strategy  in  all 
cases.  The  reason  behind  this  difference  of  optimization  times  is  predominantly  due  to 
two  factors: 

-  The  difference  of  complexities  between  the  two  approaches  as  they  were  described  in 
chapter  5  (sections  5.2.2  and  5.3.1).  Indeed,  the  larger  state  space  of  one-phase 
strategies  causes  the  inner  loop  of  Iterative  Improvement  (see  section  6.3.1)  to  restart 
more  often  than  in  a  two-phase  strategy  approach  (each  time  the  network  is 
improved,  the  inner  counter  is  reset  to  0).  As  a  result,  each  newly  generated  network 
from  the  outer  loop  requires  a  larger  number  of  iterations  to  settle  in  a  minimum 
state. 
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-  The  operators  "Push  Sigma"  and  "Pull  Sigma"  are  substantially  more  complicated 
than  the  three  original  ones  both  in  terms  of  lines  of  code  executed  as  well  as  in  terms 
of  impact  on  the  network.  Indeed,  while  creating  or  killing  a  beta  node  only  affects  a 
network  locally,  the  action  of  relocating  a  selection  predicate  can  potentially  have  a 
repercussion  on  a  much  larger  number  of  nodes  as  illustrated  in  the  example  of 
Figure  6-13  below: 


P 


Figure  6-13  Impacts  of  "Kill  Beta"  and  "Push  Sigma"  operators  on  a  network 
Impact  of  killing  node  p^: 

When  node  pi  is  killed,  tti  and  a2  become  children  of  ^2,  but  the  cardinality  of 
this  latter  is  not  affected.  The  changes  in  the  network  are  therefore  "contained"  to 
restructuring  the  children  of  P2  and  re-computing  its  cost.  The  same  remarks  are  valid  for 
"Create  Beta"  and  "Merge  Sibling." 
Impact  of  pushing     above  the  Pnode  P: 

If  ai  is  pushed  down  the  network  and  placed  just  above  P,  the  cardinalities  of 
nodes  ai.  Pi  and  P2  are  changed  (increased).  Consequently,  the  cost  of  each  of  these 
nodes  needs  to  be  re-computed.  The  change  in  the  cardinality  of  P2  changes  not  only  the 
cost  of  the  node  itself,  but  can  potentially  affect  the  join  plan  (order  in  which  the  joins  are 
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done)  of  each  of  its  siblings  p3  and  Of,.  This  is  true  whenever  a  node,  for  which  the  join 
plan  contains  more  than  two  nodes,  has  its  cardinality  changed.  As  a  result,  the  relocation 
of  Oi  above  P  has  a  ripple  effect  causing  a  more  significant  number  of  nodes  to  be 
recalculated  and  is  therefore  a  lengthier  operation.  (The  same  remark  also  applies  to  the 
"Pull  Sigma"  operator). 

Since  one-phase  strategies  are  likely  to  execute  the  "Push  Sigma"  and  "Pull 
Sigma"  operators  more  often  than  a  two-phase  approach,  this  factor  also  participates  into 
making  OPS  slower  than  TPS. 

Going  back  to  the  results  of  Figure  6-10,  the  increase  in  the  running  time  of  TPS 
(for  larger  join  selectivity  factors)  is  primarily  attributed  to  a  smaller  proportion  of  No- 
Ops  in  the  optimization  process.  When  considering  the  original  three  local  change 
operators,  "Create  Beta"  "Kill  Beta"  and  "Merge  Sibling"  that  are  called  in  the  first  phase 
of  TPS,  "Create  Beta"  and  "Merge  Sibling"  require  internal  nodes  with  at  least  three 
children.  (Note  that  "Merge  Sibling"  even  requires  that  at  least  one  of  those  children  be  a 
beta  node).  Therefore,  when  the  shape  of  the  network  tends  towards  that  of  a  binary  tree 
(with  more  selective  joins),  the  number  of  operations  that  can  be  executed  is  sensibly 
reduced  and  more  calls  to  the  local  change  operators  generate  No-Ops,  leading  to  faster 
optimization  times.  On  the  other  hand,  when  the  internal  nodes  have  more  children,  the 
number  of  No-Ops  reduces,  and  each  operator  actually  executes  (try  operator,  followed  by 
either  validation  or  rolling  back),  leading  to  longer  optimization  times  despite  a  smaller 
number  of  iterations  (see  Note  below).  In  conclusion,  an  increase  in  the  running  times  of 
TPS  is  actually  expected  with  less  selective  joins,  as  obtained  experimentally. 
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Note: 

An  interesting  observation  we  made  while  running  the  tests  is  that  high-and- 
narrow  networks  will  typically  require  a  larger  number  of  iterations  than  bushy  ones.  The 
reason  seems  to  be  the  fact  that  the  initial  random  networks  usually  have  bushy  shapes 
and  therefore  require  more  iterations  to  reach  their  final  configuration,  when  it  is  (close 
to)  a  non-bushy  tree  (network). 

Still  considering  the  results  of  Figure  6-10,  the  decrease  in  the  running  times  of 
OPS  is  explained  by  a  tangible  reduction  in  the  number  of  iterations  (see  Note  above) 
combined  with  an  increase  in  the  proportion  of  No-Ops  when  the  joins  become  less 
selective.  While  the  lessening  in  join  selectivity  (leading  to  flatter  networks)  caused  a 
reduction  of  the  proportion  of  No-Ops  in  TPS  (and  slower  running  times),  it  does  have  an 
opposite  effect  in  this  case  and  makes  OPS  faster.  This  increase  in  the  ratios  of  No-Ops  is 
attributed  to  the  presence  of  sigmas  within  networks  that  have  less  internal  nodes 
(because  they  are  flatter),  causing  higher  proportions  of  "Kill  Betas"  and  "Merge 
Siblings"  to  fail.  (See  sections  5.3.2,  5.3.3  and  5.3.4  about  the  "No-Merge  No-Kill" 
approach). 

In  the  next  test  (Figure  6-12),  the  trends  were  reversed:  OPS  got  moderately 
slower  while  the  running  times  of  TPS  became  significanfly  faster  with  respect  to  those 
measured  from  Figure  6-9.  As  mentioned  earlier,  these  variations  were  expected  since  the 
reduction  of  the  selectivity  factors  of  selection  predicates  has  essentially  the  same  effect 
on  the  morphology  of  the  network  as  the  reduction  in  the  selectivity  of  joins. 
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6.4.4  Testing  Result  Quality  for  Both  Strategies: 

In  this  section,  we  will  put  the  focus  on  the  quality  of  the  results  (costs  of  the 
networks)  for  the  same  sets  of  parameters  as  those  presented  in  section  6.4.2.  The  vertical 
bars  will  represent  here  a  percentage  difference  between  the  costs  of  the  networks 
obtained  by  one  and  two  phase  strategies:  100  *  (cost(OPS)  -  cost(TPS))  /  cost(OPS). 
Note  that  negative  values  of  the  percentages  will  indicate  cases  where  OPS  yields  a  better 
result  (smaller  network  cost)  than  OPS. 
Testing  with  selective  joins  and  large  sigmas: 

As  the  reader  will  notice  from  the  graph  represented  in  Figure  6-14  below,  the 
differences  in  the  quality  of  the  results  are  significantly  smaller  than  those  obtained  for 
the  running  times.  The  absolute  value  of  the  difference  remains  under  10%  and  OPS 
performs  slightly  better  than  TPS  for  4  and  10  tuple  variables.  In  both  cases  the 
improvement  over  TPS  remains  under  2%  and  is  not  considered  as  a  significant  one. 
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Figure  6-14  Relative  network  costs  for  selective  joins  and  non-selective  sigmas 
In  the  following  test,  the  join  selectivity  factors  were  again  multiplied  by  a  factor  of  5  in 
order  to  estimate  the  influence  of  JSFs  over  the  quality  of  the  results. 
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Testing  with  non-selective  joins  and  large  sigmas: 

As  observed  in  Figure  6-15  below,  the  increase  in  the  selectivity  of  joins  does  not 
have  any  significant  impact  on  the  relative  quality  of  the  results.  The  two-phase  approach 
still  yields  better  results  despite  a  moderate  decrease  in  the  differences  with  respect  to 
OPS. 
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Figure  6-15  Relative  network  costs  for  non-selective  joins  and  non-selective  sigmas 
In  the  following  run,  the  joins  were  left  with  the  same  values  as  in  Figure  6-15, 

and  selection  predicates  were  changed  so  that  a  majority  would  have  their  selectivity 

factors  below  0. 1 . 

Testing  with  non-selective  joins  and  selective  sigmas: 
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Figure  6- 1 6  Relative  network  costs  for  non-selective  joins  and  selective  sigmas 
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As  can  be  seen  in  the  graph  above,  the  reduction  in  the  selectivity  factors  of 
sigmas  did  have  a  significant  impact  on  the  resulting  networks.  Indeed,  with  rather 
selective  selection  predicates,  the  one-phase  approach  does  systematically  return  better 
results  than  the  two-phase.  The  improvement  ranges  from  a  minimum  of  about  2%  to  a 
maximum  surpassing  20%,  with  an  average  improvement  of  10%  approximately. 

6.4.5  Analysis: 

From  the  results  of  the  set  of  tests  ran  in  section  6.4.4,  it  shows  that  the  selectivity 
of  selection  predicates  is  the  most  significant  parameter  influencing  the  quality  of  the 
results  obtained. by  the  implemented  strategies.  A  majority  of  selection  predicates  with  a 
large  selectivity  factor  gives  the  advantage  to  a  two-phase  based  approach  while  a  greater 
proportion  of  selective  predicates  changes  the  trend  in  a  tangible  way.  From  the  analysis 
of  the  shapes  of  the  generated  networks,  we  have  been  able  to  reach  the  conclusion  that 
the  difference  in  the  quality  of  results  is  due  the  fact  that  the  larger  state  space  factor 
works  in  favor  of  OPS  in  the  case  of  selective  selection  predicates.  Indeed,  with  a 
majority  of  selective  selection  predicates  placed  at  the  top  of  the  network,  a  two-phase 
approach  almost  systematically  yields  a  final  network  (before  the  second  phase  is  applied) 
with  a  binary  shape  since  all  internal  nodes  are  fairly  small.  The  second  phase  in  which 
selection  predicates  are  pushed  down  does  not  change  the  binary  property  of  the  network. 
On  the  other  hand,  the  ability  of  a  one-phase  approach  to  relocate  selection  predicates 
anytime  during  the  whole  optimization  process  often  reaches  configurations  where  one  or 
more  internal  nodes  have  more  than  two  children.  More  generally,  the  positioning  of 
selective  selection  predicates  is  more  "sensitive"  than  that  of  non-selective  ones  (i.e.  has  a 
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more  significant  impact  on  the  overall  cost  and  shape  of  the  network).  Therefore,  the 
larger  state  space  and  number  of  "tries"  to  relocate  sigmas  in  OPS  is  beneficial  when  they 
are  selective  and  less  beneficial  when  they  are  not. 

6.4.6  Conclusion: 

In  the  tables  below,  the  reader  will  find  a  summary  and  recapitulation  of  the 
experimental  results  that  were  obtained  in  the  previous  sections.  Table  3  gives  a 
comparison  of  the  running  times  of  one  and  two-phase  strategies  as  the  selectivities  of 
joins  and  selection  predicates  become  larger  (as  the  predicates  become  less  selective). 
Table  4  focuses  on  the  quality  aspect  of  the  results  obtained  by  each  of  the  strategies.  In 
both  tables,  the  horizontal  dimension  represents  the  selectivity  factors  of  the  predicates 
(growing  from  left  to  right). 


Table  3  Comparing  running  times  of  OPS  and  TPS 


Sigma 
Selectivity 

Selective 

Sigma  selectivity  factors  increase  from  left  to  right  on  this  axis 

Not  selective 

Joins 
Selectivity 

Selective 

Join  selectivity  factors  increase  from  left  to  right  on  this  axis 

Not  selective 

Network 
Shape 

\J                 Networks  become  flatter  and  wider  from  left  to  right  on  this  axis 

OPS 
Running 
Time 

TPS  much 
faster  than 
OPS 

Slow 
(High  running 

~       ^.^^  Becomes 
faster 

TPS  faster 

TPS 
Running 
Time 

^ — *^  Becomes 

^^^^                                  -^^^  slower 
(Low  running          ^  ' 

times)  ■^■"""^ 

than  OPS 

As  the  selectivity  factors  of  sigmas  and  joins  become  larger,  the  shapes  of  the 
networks  change  from  high  and  narrow  to  flatter  and  wider  ones  (the  reason  is  that 
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internal  nodes  become  larger  and  more  expensive  to  maintain,  which  results  in  a  smaller 
number  of  them).  On  the  left  side  of  the  table  above,  OPS  running  times  are  high  and  get 
lower  as  we  move  further  to  the  right.  The  trend  is  reversed  for  TPS.  Consequently,  TPS 
is  significantly  faster  (up  to  3  or  4  times)  than  OPS  for  selective  predicates,  and  the 
difference  in  the  running  times  drops  towards  more  moderate  values  (1.5  or  2)  as  the 
predicates  become  less  selective.  Note  that  TPS  remains  faster  than  OPS  in  all  cases. 
However,  it  is  important  to  point  out  at  the  fact  that  the  absolute  values  of  the  running 
times  are  fairly  small,  and  considering  the  relative  values  only  might  be  misleading  when 
choosing  a  strategy  to  run.  Indeed,  even  with  a  running  time  up  to  3  times  slower  than 
TPS,  OPS  still  optimizes  a  network  in  under  a  minute  for  large  networks  (14  or  15  tuple 
variables)  and  under  15  seconds  for  regular  ones  (5  to  10  tuple  variables). 


Table  4  below  recapitulates  the  result  qualities  of  both  strategies. 

Table  4  Comparing  networks  costs  of  OPS  and  TPS 


Sigma 
Selectivity 

Selective 

Sigma  selectivity  factors  increase  from  left  to  right  on  this  axis 

Not  Selective 

OPS 
Relative 
Networks 

Costs 

OPS  returns 
better  results 
than  TPS 

OPS 

networks  ^ — """'^ 
costs  are  low  --^ 

^ — ^       OPS  networks 

^                          costs  become 

higher  relatively 
to  TPS 

TPS  returns 
better  results 
than  OPS 

TPS 

Relative 
Networks 
Costs 

(networks 
with  smaller 
costs) 

TPS  networks  ^^^^ 
costs  are  high  — 

TPS  networks 
costs  become 
lower  relatively  to 
OPS 

When  selection  predicates  are  not  selective  (right  side  of  the  table),  the  two-phase 
strategy  returns  better  results  on  average  than  a  one-phase  approach.  However,  as  they 
become  more  selective  on  the  left  side  of  the  table,  the  tendency  is  reversed  and  OPS 
becomes  more  effective.  The  reason  for  this  shifting  in  the  effectiveness  of  each  strategy 
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is  explained  by  the  relationship  between  the  selectivity  of  sigmas,  and  the  optimization 
process.  Non-selective  selection  predicates  have  little  bearing  on  the  overall  result  of  the 
optimization  process,  since  their  filtering  capacity  is  low.  The  placement  of  non-selective 
sigmas  in  the  network  can  be  considered  as  an  orthogonal  problem  to  that  of  optimizing 
the  network  itself,  and  trying  to  mix  both  in  one  single  phase  causes  a  larger  state  space 
without  any  significant  benefit.  It  is  better  to  treat  the  optimization  of  the  network  and  the 
placement  of  the  sigmas  separately,  in  two  phases.  On  the  other  hand,  when  selection 
predicates  become  (very)  selective,  their  influence  on  the  shape  of  the  network  is  more 
significant,  and  their  placement  critical.  The  problem  of  placing  selection  predicates  is  no 
more  (or  less)  orthogonal  to  that  of  optimizing  the  network,  and  running  the  optimization 
in  one  phase  becomes  beneficial  and  the  larger  state  space  advantageous. 

TPS  or  OPS? 

Given  the  global  results  presented  so  far,  our  recommendations  for  the 
implementation  of  a  complete  Gator  networks  optimizer  with  selection  predicate 
placement  capabilities  are  the  following.  If  only  one  strategy  is  to  be  implemented,  then  a 
two-phase  approach  is  definitely  recommended  because  of  the  subsequent  reasons: 
-    The  running  time  of  the  second  phase  in  which  sigmas  are  placed  is  completely 
negligible  with  respect  to  the  running  time  of  the  first  phase.  Therefore,  the  addition 
of  the  second  phase  does  not  increase  the  running  time  of  the  optimizer  and  "will  not 
harm,  if  it  does  not  help." 
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-  The  results  obtained  by  the  two-phase  strategy  in  terms  of  the  quality  of  the  returned 
networks  are  always  acceptable.  Even  when  OPS  returns  better  results  than  TPS,  the 
differences  in  network  costs  are  never  different  by  an  order  of  magnitude. 

If  one  is  willing  to  go  to  the  extra  complexity  of  implementing  both  methods,  then  the 

optimizer  can  be  tuned  to  select  one  strategy  or  the  other  based  on  the  relevant 

parameters. 

Or  both? 

As  we  mentioned  earlier,  even  the  slowest  running  times  are  still  under  a  minute, 
for  networks  of  14  or  15  tuple  variables.  The  optimization  times  drop  to  about  15  seconds 
for  networks  of  sizes  ranging  between  5  and  10  tuple  variables.  Moreover,  these  running 
times  will  even  become  smaller  with  state  of  the  art  processors  running  at  clock  speeds  of 
300  to  400  Mhz.  Therefore,  a  very  conceivable  approach  could  be  to  run  both  strategies 
and  pick  the  best  result. 


CHAPTER  7 
CONCLUSION  AND  FUTURE  WORK 

Although  the  support  of  complex  triggering  capabilities  is  still  limited  in 
commercial  database  systems  (see  chapter  1),  it  is  our  belief  that  the  design  and 
implementation  of  TriggerMan  as  an  Asynchronous  Rule  Processor,  and  its  full 
compatibility  with  the  Object-Relational  model,  are  an  important  step  towards  achieving 
this  goal.  In  this  dissertation,  we  have  presented  the  major  issues,  which  one  is  likely  to 
come  across  while  implementing  and  integrating  selection  predicate  placement 
capabilities,  within  a  randomized  discrimination  network  optimizer.  After  briefly 
describing  the  Gator  network  cost  model  (chapter  3),  the  inadequacy  of  rank-based  query 
processing  techniques  was  shown  (chapter  4),  and  a  set  of  alternative  strategies  (of 
increasing  levels  of  complexity)  was  proposed  (chapter  5).  Next,  an  implementation 
phase  was  carried  out  (chapter  6),  during  which  the  feasibility  and  relative  efficiencies  of 
the  proposed  strategies  were  tested  and  evaluated.  In  the  course  of  the  coding  phase,  a 
new  and  more  efficient  approach  was  suggested  for  the  implementation  of  the  local 
change  operators,  using  a  tripartite  structure:  TryOperator,  DoOperator  and  UndoOperator 
for  each  of  them.  Finally,  and  based  on  the  experimental  results  obtained  in  chapter  6,  a 
set  of  suggestions  was  made  for  integrating  the  selection  predicate  placement  techniques 
within  the  TriggerMan  code. 
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While  answering  many  questions  on  the  issue  of  selection  predicate  placement  in 
Gator  and  discrimination  networks,  this  work  still  leaves  open  research  questions  and 
issues  to  be  investigated  in  future  research  endeavors: 

A  general  optimization  technique  widely  used  to  reduce  the  cost  of  all  types  of 
expensive  operations  is  caching.  While  this  issue  was  left  aside  in  this  work,  the 
possibility  of  caching  the  results  of  a  selection  predicate  for  a  particular  token  is  a 
viable  optimization  add-on,  to  consider  improving  token  propagation  times  in 
discrimination  networks.  A  simple  approach  would  be  to  implement  sigma  nodes  as 
lookup  tables  instead  of  pure  selection  predicates.  Each  time  a  compound  token 
containing  the  attribute  on  which  the  selection  is  defined,  is  propagated  through  a 
sigma  (cyattr=vaiue( token)  =  true  or  false),  the  result  of  applying  the  sigma  to  that  token, 
or  value  of  the  attribute  can  be  cached  (function  caching).  Subsequent  (potentially 
expensive)  computations  for  tokens  with  the  same  value  of  the  attribute  can  be 
looked  up  instead  of  recomputed.  The  integration  of  caching  in  the  optimization 
process  would  require  the  development  of  a  corresponding  cost  formula  or  model, 
which  is  an  orthogonal  problem  to  ours.  Once  a  particular  cost  model  or  formula  has 
been  developed,  its  integration  with  our  optimizer  becomes  straightforward  since  it 
does  not  require  anything  more  than  plugging  the  new  formulas  in  the  proper 
methods.  (This  would  be  very  similar  to  extending  our  optimizer  with  the  cases  that 
were  left  aside  such  as  having  indexes  on  join  attributes  etc.) 
-  As  mentioned  in  chapter  1,  many  implementations  of  materialized  views  and  their 
maintenance  reuse  active  capabilities  and  active  rules.  A  simplified  (but  accurate) 
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description  of  the  relationship  between  a  materiahzed  view  and  a  Gator  network  is 
given  by  the  following  correspondence: 

If  tokens  inserted  in  the  Pnode  of  a  Gator  network  are  never  cleared  out 
systematically,  and  kept  in  the  Pnode  until  they  are  erased  by  a  corresponding  A- 
(read  delta  minus)  token  [Han92],  then  the  Pnode  becomes  a  materialized  view  of  the 
database,  matching  the  condition  defined  by  the  Gator  network. 
Consequently,  the  issue  of  optimizing  materialized  views  with  aggregates,  becomes 
very  related  to  our  problem,  since  aggregates  can  be  considered  "similar"  to  selection 
predicates,  in  the  sense  that  they  can  potentially  be  relocated  at  multiple  places  in  the 
materialized  view.  Given  the  heavy  use  of  materialized  views  in  important  areas  of 
data  management  such  as  data  warehousing,  it  is  our  belief  that  the  optimization  of 
materialized  views  with  aggregates  could  benefit  from,  or  reuse  and  extend,  some  of 
the  insight  and  ideas  exposed  in  this  work. 

As  a  closing  remark  to  this  work,  we  would  like  to  emphasize  the  investigative 
nature  of  this  dissertation  as  a  first  attempt  to  explore  the  various  issues  and  questions 
related  to  selection  predicate  placement,  in  Gator  and  discrimination  networks  in  general. 
A  two-phase  strategy  emerged  as  the  best  approach  for  a  first  implementation  of  the 
optimizer  to  be  written  for  TriggerMan.  However,  the  introduction  of  a  one-phase 
approach  is  not  excluded  for  future  versions,  as  the  project  gains  maturity  and  efficiency 
issues  take  the  stand  over  major  end-to-end  functionality  implementation  tasks. 


GLOSSARY 


ADBMS 

Active  Database  Management  System. 

ADT 

Abstract  Data  Type. 

AI 

Artificial  Intelligence. 

ARP 

Asynchronous  Rule  Processor.  (Synonymous  with  ATP) 

ATP 

Asynchronous  Trigger  Processor. 

DBMS 

Database  Management  System. 

ESF 

Edge  Selectivity  Factor. 

GIS 

Geographic  Information  System. 

IDS/UDO 

Informix  Dynamic  Server  with  Universal  Data  Option. 

JSF 

Join  Selectivity  Factor. 

00 

Object  Oriented. 

OPS 

One-Phase  Strategy. 

TPS 

Two-Phase  Strategy. 
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APPENDIX  A 


ANIMAL 

LOCATION 

Gator 

(1,1) 

Fros 

(3,1) 

Gator 

(3,4) 

Gator 

(5,5) 

Crow 

(1,1) 

Gator 

(8.4) 

Raccoon 

(4,3) 

Gator 

(3,0) 

Gator 

(1,4) 

Gator 

(4,1) 

Frog 

(7,3) 

Gator 

(3,3) 

Frog 

(1,1) 

Gator 

(1,4) 

Gator 

(3,2) 

Rabbit 

(4,0) 

Gator 

(3,4) 

Raccoon 

(5,0) 

Mouse 

(5,1) 

Gator 

(5,5) 

Table  Fauna_Sightings 


User 

Box 

Thorfan 

(0,0,1,1) 

Thorbait 

(5,5,30,30) 

Etc... 

Table  InterestArea 
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APPENDIX  B 

The  analysis  of  the  inadequacy  of  the  rank  metric  for  discrimination  networks  that 
was  presented  in  chapter  3,  rehed  on  an  example  network  where  a  beta  node  had  only  two 
children.  The  reader  will  find  below  a  generalization  of  that  analysis  to  cases  where  beta 
nodes  can  have  more  than  two  children.  For  ease  of  reading,  this  example  is  restricted  to 
three  children  only.  However,  a  simple  change  in  notations  is  required  to  represent  cases 
with  N  children. 

Consider  the  network  (or  sub  network)  below  where  the  beta  node  has  three  children  tti, 
tti  and  a3. 


Cost:  Ca2 


Fi,  F2  and  F3  are  the  frequencies  of  tokens  arriving  respectively  into  the  left,  middle  and 
right  input  edges  of  the  network.  The  selectivities  of  selection  predicates  ai,  G2  and  are 
denoted  So,,  802  and  So,.  The  selectiviy  factors  of  the  joins  are  represented  by  JSFi  and 
JSF2.  The  "directional"  selectivity  factors  are  still  defined  as  they  were  in  chapter  3: 
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Sj,:  =  JSF,  *  Card(a2). 
Sj,,  =  JSFi  *  Card(ai). 
Sj23  =  JSFi  *  Card(a3). 
Sj32  =  JSFi  *  Card(a2). 

The  semantics  of  the  directional  selectivity  factor  are  as  follows: 

Each  token  inserted  into  tti  produces  on  average  Sj,2  tokens  after  joining  with  a2. 
Likewise,  each  token  inserted  into  produces  on  average  Sjzi  and  Sjij  tokens  when  joining 
with  tti  and  tts  respectively.  In  a  similar  fashion,  a  token  inserted  into  as  produces  on 
average  Sj32  tokens  when  joining  with  a2.  Tokens  inserted  into  ai  will  therefore  generate 
an  average  total  of  Sjn  *  Sj23  tokens  after  joining  with  both  and  as.  Consequently,  the 
ESF  of  edge  (tti^P)  denoted  Sji  will  be  the  result  of  the  product  of  Sj^and  Sjas : 

Sjl  =  Sji2  *  Sj23 

and  likewise: 

Sj2  =  Sj2i  *  Sj23  and  Sjs  =  Sj32  *  Sj2i 
The  reader  will  notice  that  when  a  beta  node  has  more  than  two  children,  the  ESFs  are  no 
longer  equal  to  the  "directional"  selectivity  factors,  histead,  the  ESF  of  an  edge 
(child-^(3)  is  computed  as  the  product  of  all  the  "directional"  selectivity  factors 
encountered  in  the  join  plan  of  the  child  node  in  question. 

Finally,  the  selection  predicate  Cj  has  been  assigned  a  cost  of  C02,  and  the  join  conditions 
between  ai  and  a2  and  a2  and  have  been  assigned  differential  cost  values  of  Cj,2,  Cj2i, 
Cj23  and  Cj32. 
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Let  us  consider  now  the  operation  of  pushing  down  the  selection  predicate  CSj  as 
described  in  the  figure  below. 


C2  S(T2 

p 

With  the  assumption  that  predicates  have  independant  selectivities  still  holding,  the  JSF 
values  of  each  join  remain  unchanged  (as  well  as  the  selectivity  of  the  selection  predicate 
02).  Consequently,  Sj^  and  Sjn  change  to  Sj^'  and  Sj32'.  The  changes  occuring  to  Sj^  and 
Sj32  cause  the  values  Sj,  and  Sjj  to  change  to  Sj,'  and  Sjj'  respectively.  Sj2  and  the 
selectivity  of  csj  remain  unchanged. 

Recalling  the  definition  of  the  rank  metric,  the  rank  of  G2  is:  rank(a2)  =  ^"'"^ 

C<J2 

Let  Cj/a2  represent  the  cost  of  the  join  with  respect  to  the  base  node  a2.  The  rank  of  the 
join  with  respect  to  aj  is  then  defined  as:   rank(J/a2)  =  .^iinl 

Cj/or2 

As  discussed  in  chapter  4,  the  rank  based  heuristic  orders  the  predicates  in  ascending 
order  of  the  rank  metric,  and  will  therefore  decide  to  perform  a  PUSHDOWN  when  the  rank 
of  the  join  is  smaller  than  the  rank  of  the  selection  predicate  G2:  rank(J/a2)  <  rank(a2) 
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Replacing  rank(J/a2)  and  rank(a2)  with  their  respective  values  defined  above,  we  can 

3  ,  _  1     So2  —  1 

rewrite:  rank(J/  ai)  <  rank(  02)  <=>  -j^  <  — ^  

(-^02  Ca2 

Then  by  multiplying  each  side  of  the  equation  above  by  Q/aa  and  dividing  by  (S02-  1),  we 

get  the  following  inequality  — ^  <  —        (reversing  the  comparison  since  S02  - 1  <  0) 

C02     S02  —  1 

1  -  S  2 

Which  can  be  rewritten  as  I  Uj/a2  <  *  Ua2  (4) 

I-S02 

after  each  side  of  the  inequality  (4)  above  is  multiplied  by  the  positive  value  Ca2.  The  rank 
based  heuristic  will  therefore  perform  a  pushdown  when  Q/ai  is  strictly  smaller  than  the 
right  side  of  (4)  and  not  perform  the  pushdown  otherwise. 

Reusing  the  notation  from  chapter  4,  let  AC"^  and  AC"  represent  the  variations  in  the  cost 
of  the  network  when  the  selection  predicate  G2  is  pushed  down. 

AC^:  F2  *  (Cj2,  +  Sin  *  Cj23  +  la^)  -  F2  *  Saa  *  (Cj,  +  *  Sja,  *  Cj,  +  la^  +  Fi  *  Sa,  *  AQn 

+  F3  *  Sa3  +  ACj32 

AC^:  F2  *  (1  -  SaO  *  (Cj2,  +  Sj2,  *  Cj23  +  laa)  +  F,  *  Sa,  *  AQn  +  F3  *  S03  *  AQ„ 

The  (potential)  decrease  in  the  execution  of  the  selection  predicate  condition  is  given  by: 

AC":  F2  *  Ca2  *  -  [F,  *  Sa,  *  Sjr  +  F2  *  Si2  +  F3  *  Sa,  *  Sj3']  *  Ca2 

ACT:  {F2  -  [Fi  *  Sa,  *  Sj,'  +  F2  *  Sj^  +  F3  *  Sa,  *  Sj3']}  *  Ca, 

Analogously  to  the  approach  followed  in  chapter  4,  let  us  define  ratios  linking  the  various 
token  frequencies  in  order  to  study  the  influence  of  the  token  distributions  among  the 
various  input  edges  of  the  network: 
Let  F|  =  F2  /  mi2  with  mi2  e  Q"^' 
And  F3  =  F2  /  m32  with  m32  e  Q^* 
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Where  Q^*  is  the  set  of  positive,  non-null  rational  numbers. 
AC"  can  now  be  rewritten  as: 

[F2  -  (—  *  So,  *  Sj,'  +  F2  *  Sj2  +  —  *  So,  *  Sj,')]  *  C02 
Factoring  out  F2 

F2  *  [1  Sj2  J  *  C02 

mi2  m32 
And  reducing  to  a  common  denominator 

„  .  rmi2  *m32 -mn  *Soi  *Sji'-mi2  *S<jj*Sji'  „ 

r2*[  Sj2]*Ca2 

mi2  *  m32 

The  condition  for  the  pushdown  is  0  <  AC*'  <  AC". 

Replacing  AC"^  and  AC"  by  their  respective  correpsonding  expressions: 


F2  *  (1  -  S02)  *  [Cj2,  +  Sj2,*  Q23  +  Ia2]  +  —  *  Sa,*  ACj.2  +  —  *  Sa3*  ACj32  < 

mi2  m32 


_  . -mi2  *m32 -m32  *Soi  *Sjr-mi2  *So3*Sj3'    „  , 

F2*[  Sj2] 

m,2  *  m32 


Which  after  dividing  by  F2  on  both  sides  can  be  simplified  into: 


/I      o     \  *  r>-<           „      J.  „           _    ,        Soi  *  ACjl2        Sa3  *  ACj32 
(1  -  S02)  *  [Cj21  +  Sj21*  Cj23  +  lai]  +   —  +   —  < 

ran  m32 


mi2  *  m32  -  m32  *  Soi  *  Sji'- mi2  *  Sa3  *  Sj3' 

I  Sj2] 

mi2  *  m32 


Finally  dividing  each  side  of  the  inequality  above  by  (1  -  802),  and  rearranging  terms  we 
get 


123 


With  the  quantities  Qi  and  Q2  being  equal  to   — — ^ —  and 


mi2*(l-SoO 


m32*(l-Sa2) 


respectively. 

The  table  below  recapitulates  the  results  obtained  so  far.  The  rank  based  decision  is  the 
one  obtained  earlier  in  inequality  (4)  and  the  cost  formula  based  approach  is  taken  from 
the  inequality  above. 

Table  5  Recapitulation  of  the  conditions  for  pushing  down  02 


PUSHDOWN 

Rank  Based 

1  -  Sr 

Ci/02  <           ...  *  (2a2 
1-Sa2 

Cost  formula 
Based 

mi2  *ni32  -m32  *Soi  *Sji'-mi2  *Sa3*Sj3'  ^ 

Cj,  +  Sj2,*Cj2<                     m,2*m,2                                     Qi  Q2 

1  -  S02                                   mi2  m32 

While  Cj/a2  and  (Q,  +  Sjzi  *  Qj)  might  seem  to  be  different  at  first  sight,  a  more  careful 
examination  of  the  cost  incurred  when  a  token  is  inserted  into  node  clearly  shows  that 
they  represent  the  same  quantities.  Indeed,  for  each  token  inserted  in  a2,  cost  Q,  is 
incurred  once  during  the  join  with  tti,  then  the  resulting  average  of  Sjzi  tokens  is  joined 
with  as,  incurring  a  cost  of  Sj2i  *  Q2. 

The  function  defined  as  f(m)  in  chapter  4  becomes  in  this  case  f(m,2,  m32). 


124 


mi2  *  m?:  —  mi2  *  Soi  *  Sji '  — 


m,2  *So,*Sj,' 


-  S  r. 


Let  f(mi2,mi:)  = 


mi2  *  mi2 


1-S 


mi2  ni32 


With  Sj,',  Sj2,  Sjj',  Qi  and  Q2  6  R^*  and  Soi,  S02  g  (0,1) 


The  properties  (limits,  continuity  and  variations)  of  f(mi2,  m32)  with  respect  to  the 
variables  niu  and  are  similar  to  those  of  the  function  f(m)  defined  in  chapter  4.  The 
same  conclusions  can  therefore  be  reached  in  the  cases  where  beta  nodes  have  multiple 
children:  The  variations  in  the  bias  of  the  distributions  of  token  frequencies  will  cause 
erroneous  decisions  if  based  on  the  rank  metric. 
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